Download doc - Database Technology

Transcript
Page 1: Database Technology

ANNA UNIVERSITY- CHENNAI-JUNE 2010 amp DECEMBER 2010DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

SUB CODESUB NAME CS9221 DATABASE TECHNOLOGY

Part A ndash (102=20 Marks)

1 What is fragmentation (JUNE 2010)Fragmentation is a database server feature that allows you to control where data is stored at the table level Fragmentation enables you to define groups of rows or index keys within a table according to some algorithm or scheme You use SQL statements to create the fragments and assign them to dbspaces

2 What is Concurrency control (JUNE 2010) (NOV DEC 2010)Concurrency control is the activity of coordinating concurrent accesses to a database in a multiuser system Concurrency control allows user to access a database in a multi-programmed fashion while preserving the consistency of the data

3 What is Persistence (JUNE 2010) (NOV DEC 2010)Persistence is the property of an object through which its existence transcends time ie (the object continues to exist after its creator ceases to exist) andor space (ie the objectrsquos location moves from the address space in which it was created)

4 What is Transaction Processing (JUNE 2010)A Transaction Processing system (TPS) is a set of information which processes the data transaction in database system that monitors transaction programs (a special kind of program) For eg in an electronic payment is made the amount must be both withdrawn from one account and added to the other it cannot complete only one of those steps Either both must occur or neither In case of a failure preventing transaction completion the partially executed transaction must be rolled back by the TPS

5 What is ClientServer model (JUNE 2010)

The server in a clientserver model is simply the DBMS whereas the client is the database application serviced by the DBMS

1

The clientserver model of a database system is classified into basic amp distributed clientserver model

6 What is the difference between data warehousing and data mining (JUNE 2010)Data warehousing It is the process that is used to integrate and combine data from multiple sources and format into a single unified schema So it provides the enterprise with a storage mechanism for its huge amount of dataData mining It is the process of extracting interesting patterns and knowledge from huge amount of data So we can apply data mining techniques on the data warehouse of an enterprise to discover useful patterns

7 Why do we need Normalization (JUNE 2010)Normalization is a process followed for eliminating redundant data and establishes a meaningful relationship among tables based on rules and regulations in order to maintain integrity of data It is done for maintaining storage space and also for performance tuning

8 What is Integrity (JUNE 2010) (NOV DEC 2010)Integrity refers to the process of ensuring that a database remains an accurate reflection of the universe of discourse it is modeling or representing In other words there is a close correspondence between the facts stored in the database and the real world it models0

9 Give two features of Multimedia Databases (JUNE 2010) (NOV DEC 2010) The multimedia database systems are to be used when it is required to administrate a

huge amounts of multimedia data objects of different types of data media (optical storage video tapes audio records etc) so that they can be used (that is efficiently accessed and searched) for as many applications as needed

The Objects of Multimedia Data are text images graphics sound recordings videorecordings signals etc that are digitalized and stored

10 What are Deductive Databases (JUNE 2010) (NOV DEC 2010)A Deductive Database is the combination of a conventional database containing facts a knowledge base containing rules and an inference engine which allows the derivation of information implied by the facts and rules

A deductive database system specify rules through a declarative language - a language in which we specify what to achieve rather than how to achieve it An inference engine within the system can deduce new facts from the database by interpreting these rules The model used for deductive databases is related to the relational data model and also related to the field of logic programming and the Prolog language

2

11 What is query processing (NOV DEC 2010)Query processing is a set of activities involving in getting the result of a query expressed in a high-level language These activities include parsing the queries and translate them into expressions that can be implemented at the physical level of the file system optimizing the query of internal form to get suitable execution strategies for processing and then doing the actual execution of queries to get the results

12 Give two features of object-oriented databases (NOV DEC 2010)The features of Object Oriented Databases are

It provides persistent storage for objects They may provide one or more of the following a query language indexing

transaction support with rollback and commit the possibility of distributing objects transparently over many servers

13 What is Data warehousing (NOV DEC 2010)Data warehousing It is the process that is used to integrate and combine data from multiple sources and format into a single unified schema So it provides the enterprise with a storage mechanism for its huge amount of data

14 What is Normalization (NOV DEC 2010)Normalization is a process followed for eliminating redundant data and establishes a meaningful relationship among tables based on rules and regulations in order to maintain integrity of data It is done for maintaining storage space and also for performance tuning

15 Mention two features of parallel Databases (NOV DEC 2010) It is used to provide speedup where queries are executed faster because more

resources such as processors and disks are provided It is also used to provide scaleup where increasing workloads are handled without

increased response time via an increase in the degree of parallelism

3

Part B ndash (516=80 Marks) (JUNE 2010) amp (DECEMBER 2010)

1 (a)Explain the architecture of Distributed Databases (16) (JUNE 2010) Or

(b) Discuss in detail the architecture of distributed database (16) (NOVDEC 2010)

4

5

6

(b)Write notes on the following (i) Query processing (8) (JUNE 2010)

7

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 2: Database Technology

The clientserver model of a database system is classified into basic amp distributed clientserver model

6 What is the difference between data warehousing and data mining (JUNE 2010)Data warehousing It is the process that is used to integrate and combine data from multiple sources and format into a single unified schema So it provides the enterprise with a storage mechanism for its huge amount of dataData mining It is the process of extracting interesting patterns and knowledge from huge amount of data So we can apply data mining techniques on the data warehouse of an enterprise to discover useful patterns

7 Why do we need Normalization (JUNE 2010)Normalization is a process followed for eliminating redundant data and establishes a meaningful relationship among tables based on rules and regulations in order to maintain integrity of data It is done for maintaining storage space and also for performance tuning

8 What is Integrity (JUNE 2010) (NOV DEC 2010)Integrity refers to the process of ensuring that a database remains an accurate reflection of the universe of discourse it is modeling or representing In other words there is a close correspondence between the facts stored in the database and the real world it models0

9 Give two features of Multimedia Databases (JUNE 2010) (NOV DEC 2010) The multimedia database systems are to be used when it is required to administrate a

huge amounts of multimedia data objects of different types of data media (optical storage video tapes audio records etc) so that they can be used (that is efficiently accessed and searched) for as many applications as needed

The Objects of Multimedia Data are text images graphics sound recordings videorecordings signals etc that are digitalized and stored

10 What are Deductive Databases (JUNE 2010) (NOV DEC 2010)A Deductive Database is the combination of a conventional database containing facts a knowledge base containing rules and an inference engine which allows the derivation of information implied by the facts and rules

A deductive database system specify rules through a declarative language - a language in which we specify what to achieve rather than how to achieve it An inference engine within the system can deduce new facts from the database by interpreting these rules The model used for deductive databases is related to the relational data model and also related to the field of logic programming and the Prolog language

2

11 What is query processing (NOV DEC 2010)Query processing is a set of activities involving in getting the result of a query expressed in a high-level language These activities include parsing the queries and translate them into expressions that can be implemented at the physical level of the file system optimizing the query of internal form to get suitable execution strategies for processing and then doing the actual execution of queries to get the results

12 Give two features of object-oriented databases (NOV DEC 2010)The features of Object Oriented Databases are

It provides persistent storage for objects They may provide one or more of the following a query language indexing

transaction support with rollback and commit the possibility of distributing objects transparently over many servers

13 What is Data warehousing (NOV DEC 2010)Data warehousing It is the process that is used to integrate and combine data from multiple sources and format into a single unified schema So it provides the enterprise with a storage mechanism for its huge amount of data

14 What is Normalization (NOV DEC 2010)Normalization is a process followed for eliminating redundant data and establishes a meaningful relationship among tables based on rules and regulations in order to maintain integrity of data It is done for maintaining storage space and also for performance tuning

15 Mention two features of parallel Databases (NOV DEC 2010) It is used to provide speedup where queries are executed faster because more

resources such as processors and disks are provided It is also used to provide scaleup where increasing workloads are handled without

increased response time via an increase in the degree of parallelism

3

Part B ndash (516=80 Marks) (JUNE 2010) amp (DECEMBER 2010)

1 (a)Explain the architecture of Distributed Databases (16) (JUNE 2010) Or

(b) Discuss in detail the architecture of distributed database (16) (NOVDEC 2010)

4

5

6

(b)Write notes on the following (i) Query processing (8) (JUNE 2010)

7

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 3: Database Technology

11 What is query processing (NOV DEC 2010)Query processing is a set of activities involving in getting the result of a query expressed in a high-level language These activities include parsing the queries and translate them into expressions that can be implemented at the physical level of the file system optimizing the query of internal form to get suitable execution strategies for processing and then doing the actual execution of queries to get the results

12 Give two features of object-oriented databases (NOV DEC 2010)The features of Object Oriented Databases are

It provides persistent storage for objects They may provide one or more of the following a query language indexing

transaction support with rollback and commit the possibility of distributing objects transparently over many servers

13 What is Data warehousing (NOV DEC 2010)Data warehousing It is the process that is used to integrate and combine data from multiple sources and format into a single unified schema So it provides the enterprise with a storage mechanism for its huge amount of data

14 What is Normalization (NOV DEC 2010)Normalization is a process followed for eliminating redundant data and establishes a meaningful relationship among tables based on rules and regulations in order to maintain integrity of data It is done for maintaining storage space and also for performance tuning

15 Mention two features of parallel Databases (NOV DEC 2010) It is used to provide speedup where queries are executed faster because more

resources such as processors and disks are provided It is also used to provide scaleup where increasing workloads are handled without

increased response time via an increase in the degree of parallelism

3

Part B ndash (516=80 Marks) (JUNE 2010) amp (DECEMBER 2010)

1 (a)Explain the architecture of Distributed Databases (16) (JUNE 2010) Or

(b) Discuss in detail the architecture of distributed database (16) (NOVDEC 2010)

4

5

6

(b)Write notes on the following (i) Query processing (8) (JUNE 2010)

7

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 4: Database Technology

Part B ndash (516=80 Marks) (JUNE 2010) amp (DECEMBER 2010)

1 (a)Explain the architecture of Distributed Databases (16) (JUNE 2010) Or

(b) Discuss in detail the architecture of distributed database (16) (NOVDEC 2010)

4

5

6

(b)Write notes on the following (i) Query processing (8) (JUNE 2010)

7

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 5: Database Technology

5

6

(b)Write notes on the following (i) Query processing (8) (JUNE 2010)

7

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 6: Database Technology

6

(b)Write notes on the following (i) Query processing (8) (JUNE 2010)

7

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 7: Database Technology

(b)Write notes on the following (i) Query processing (8) (JUNE 2010)

7

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 8: Database Technology

8

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 9: Database Technology

9

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 10: Database Technology

10

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 11: Database Technology

(ii) Transaction processing (8) (JUNE 2010)A transaction is a collection of actions that make consistent transformations of system states while preserving system consistency≪ concurrency transparency ≪ failure transparency

Example Transaction ndash SQL VersionBegin_transaction Reservationbegin

11

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 12: Database Technology

input(flight_no date customer_name)EXEC SQL UPDATE FLIGHTSET STSOLD = STSOLD + 1WHERE FNO = flight_no AND DATE = dateEXEC SQL INSERTINTO FC(FNO DATE CNAME SPECIAL)VALUES (flight_no date customer_name null)output(ldquoreservation completedrdquo)end ReservationProperties of TransactionsATOMICITY≪ all or nothingCONSISTENCY≪ no violation of integrity constraintsISOLATION≪ concurrent changes invisible E serializableDURABILITY≪ committed updates persistThese are the ACID Properties of TransactionAtomicity Either all or none of the transactions operations are performed Atomicity requires that if a transaction is interrupted by a failure its partial results must be undone The activity of preserving the transactions atomicity in presence of transaction aborts due to input errors system overloads or deadlocks is called transaction recovery The activity of ensuring atomicity in the presence of system crashes is called crash recoveryConsistencyInternal consistency≪ A transaction which executes alone against a consistent database leaves it in a consistent state≪ Transactions do not violate database integrity constraintsTransactions are correct programsIsolationDegree 0≪ Transaction T does not overwrite dirty data of other transactions≪ Dirty data refers to data values that have been updated by a transaction prior to its commitmentDegree 2≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactionsDegree 3≪ T does not overwrite dirty data of other transactions≪ T does not commit any writes before EOT≪ T does not read dirty data from other transactions

12

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 13: Database Technology

≪ Other transactions do not dirty any data read by T before T completesIsolationSerializability≪ If several transactions are executed concurrently the results must be the same as if they were executed serially in some orderIncomplete results≪ An incomplete transaction cannot reveal its results to other transactions before its commitment≪ Necessary to avoid cascading abortsDurability Once a transaction commits the system must guarantee that the results of its operations will never be lost in spite of subsequent failuresDatabase recovery

Transaction transparency Ensures all distributed Ts maintain distributed databasersquos integrity and consistency

bull Distributed T accesses data stored at more than one location bull Each T is divided into no of subTs one for each site that has to be accessedbull DDBMS must ensure the indivisibility of both the global T and each of the subTs

Concurrency transparency All Ts must execute independently and be logically consistent with results obtained if Ts executed in some arbitrary serial order

bull Replication makes concurrency more complex Failure transparency must ensure atomicity and durability of global T

bull Means ensuring that subTs of global T either all commit or all abort bull Classification transparency In IBMrsquos Distributed Relational Database Architecture

(DRDA) four types of Tsbull Remote requestbull Remote unit of workbull Distributed unit of workbull Distributed request

2 (a)Discuss the Modeling and design approaches for Object Oriented Databases (JUNE 2010)

Or(b) Describe modeling and design approaches for object oriented database (16)

(NOVDEC 2010)

13

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 14: Database Technology

MODELING AND DESIGN Basically an OODBMS is an object database that provides DBMS capabilities to objects that have been created using an object-oriented programming language (OOPL) The basic principle is to add persistence to objects and to make objects persistent Consequently application programmers who use OODBMSs typically write programs in a native OOPL such as Java C++ or Smalltalk and the language has some kind of Persistent class Database class Database Interface or Database API that provides DBMS functionality as effectively an extension of the OOPL Object-oriented DBMSs however go much beyond simply adding persistence to any one object-oriented programming language This is because historically many object-oriented DBMSs were built to serve the market for computer-aided designcomputer-aided manufacturing (CADCAM) applications in which features like fast navigational access versions and long transactions are extremely important Object-oriented DBMSs therefore support advanced object-oriented database applications with features like support for persistent objects from more than one programming language distribution of data advanced transaction models versions schema evolution and dynamic generation of new types Object data modeling An object consists of three parts structure (attribute and relationship to other objects like aggregation and association) behavior (a set of operations) and characteristic of types (generalizationserialization) An object is similar to an entity in ER model therefore we begin with an example to demonstrate the structure and relationship

14

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 15: Database Technology

Attributes are like the fields in a relational model However in the Book example we have for attributes publishedBy and writtenBy complex types Publisher and Author which are also objects Attributes with complex objects in RDNS are usually other tables linked by keys to the

employee table Relationships publish and writtenBy are associations with I N and 11 relationship composed of is an aggregation (a Book is composed of chapters) The 1 N relationship is usually realized

as attributes through complex types and at the behavioral level For example

GeneralizationSerialization is the is a relationship which is supported in OODB through class hierarchy An ArtBook is a Book therefore the ArtBook class is a subclass of Book class A

subclass inherits all the attribute and method of its superclass

Message means by which objects communicate and it is a request from one object to another to execute one of its methods For example Publisher_objectinsert (rdquoRoserdquo 123hellip) ie request to

execute the insert method on a Publisher object) Method defines the behavior of an object Methods can be used to change state by modifying its attribute values to query the value of selected attributes The method that responds to the message example is the method insert defied in the Publisher class The main differences between

relational database design and object oriented database design include

Many-to-many relationships must be removed before entities can be translated into relations Many-to-many relationships can be implemented directly in an object-oriented database

15

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 16: Database Technology

Operations are not represented in the relational data model Operations are one of the main components in an object-oriented database In the relational data model relationships are implemented by primary and foreign keys In the object model objects communicate through their interfaces The interface describes the data (attributes) and operations (methods) that are visible to other objects

(b) Explain the Multi-Version Locks and Recovery in Query Languages (16) (JUNE 2010)Or

(a) Explain the Multi-Version Locks and Recovery in Query Languages (DECEMBER 2010)Multi-Version LocksMultiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect

16

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 17: Database Technology

future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could beTime Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

17

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 18: Database Technology

18

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 19: Database Technology

3 (a) Discuss in detail Data Warehousing and Data Mining (JUNE 2010)Or

(a) Explain the features of data warehousing and data mining (16)(NOVDEC 2010)

Data Warehouse Large organizations have complex internal organizations and have data stored at

different locations on different operational (transaction processing) systems under different schemas

Data sources often store only current data not historical data Corporate decision making requires a unified view of all organizational data including

historical data A data warehouse is a repository (archive) of information gathered from multiple sources

stored under a unified schema at a single site Greatly simplifies querying permits study of historical trends Shifts decision support query load away from transaction processing systems

When and how to gather data

19

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 20: Database Technology

Source driven architecture data sources transmit new information to warehouse either continuously or periodically (eg at night)

Destination driven architecture warehouse periodically requests new information from data sources

Keeping warehouse exactly synchronized with data sources (eg using two-phase commit) is too expensive

Usually OK to have slightly out-of-date data at warehouse Dataupdates are periodically downloaded form online transaction processing (OLTP)

systemsWhat schema to use

Schema integrationData cleansing

Eg correct mistakes in addresses Eg misspellings zip code errors

Merge address lists from different sources and purge duplicates Keep only one address record per household (ldquohouseholdingrdquo)

How to propagate updates Warehouse schema may be a (materialized) view of schema from data sources Efficient techniques for update of materialized views

What data to summarize Raw data may be too large to store on-line Aggregate values (totalssubtotals) often suffice Queries on raw data can often be transformed by query optimizer to use aggregate values

Typically warehouse data is multidimensional with very large fact tables Examples of dimensions item-id datetime of sale store where sale was made customer

identifier Examples of measures number of items sold price of items

20

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 21: Database Technology

Dimension values are usually encoded using small integers and mapped to full values via dimension tablesResultant schema is called a star schemaMore complicated schema structures

Snowflake schema multiple levels of dimension tables Constellation multiple fact tables

Data Mining Broadly speaking data mining is the process of semi-automatically analyzing large

databases to find useful patterns Like knowledge discovery in artificial intelligence data mining discovers statistical rules

and patterns Differs from machine learning in that it deals with large volumes of data stored primarily

on disk Some types of knowledge discovered from a database can be represented by a set of

rules eg ldquoYoung women with annual incomes greater than $50000 are most likely to buy sports carsrdquo

Other types of knowledge represented by equations or by prediction functions Some manual intervention is usually required

Pre-processing of data choice of which type of pattern to find postprocessing to find novel patterns

Applications of Data Mining Prediction based on past history

Predict if a credit card applicant poses a good credit risk based on some attributes (income job type age ) and past history

Predict if a customer is likely to switch brand loyalty Predict if a customer is likely to respond to ldquojunk mailrdquo Predict if a pattern of phone calling card usage is likely to be fraudulent

Some examples of prediction mechanisms Classification

21

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 22: Database Technology

Given a training set consisting of items belonging to different classes and a new item whose class is unknown predict which class it belongs to

Regression formulae Given a set of parameter-value to function-result mappings for an unknown

function predict the function-result for a new parameter-value Descriptive PatternsAssociations

Find books that are often bought by the same customers If a new customer buys one such book suggest that he buys the others too

Other similar applications camera accessories clothes etcAssociations may also be used as a first step in detecting causation

Eg association between exposure to chemical X and cancer or new medicine and cardiac problems

Clusters Eg typhoid cases were clustered in an area surrounding a contaminated well Detection of clusters remains important in detecting epidemics

Classification Rules Classification rules help assign new objects to a set of classes Eg given a new

automobile insurance applicant should he or she be classified as low risk medium risk or high risk

Classification rules for above example could use a variety of knowledge such as educational level of applicant salary of applicant age of applicant etc

person P Pdegree = masters and Pincome gt 75000 THORN Pcredit = excellent

person P Pdegree = bachelors and (Pincome sup3 25000 and Pincome pound 75000) THORN Pcredit = good

Rules are not necessarily exact there may be some misclassifications Classification rules can be compactly shown as a decision tree

Decision Tree Training set a data sample in which the grouping for each tuple is already known Consider credit risk example Suppose degree is chosen to partition the data at the root

Since degree has a small number of possible values one child is created for each value

At each child node of the root further classification is done if required Here partitions are defined by income

Since income is a continuous attribute some number of intervals are chosen and one child created for each interval

Different classification algorithms use different ways of choosing which attribute to partition on at each node and what the intervals if any are

In general Different branches of the tree could grow to different levels

22

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 23: Database Technology

Different nodes at the same level may use different partitioning attributes Greedy top down generation of decision trees

Each internal node of the tree partitions the data into groups based on a partitioning attribute and a partitioning condition for the node

More on choosing partioning attributecondition shortly Algorithm is greedy the choice is made once and not revisited as more of the tree

is constructedThe data at a node is not partitioned further if either

All (or most) of the items at the node belong to the same class or All attributes have been considered and no further partitioning is possible

Such a node is a leaf node Otherwise the data at the node is partitioned further by picking an attribute for partitioning data at the node

Decision-Tree Construction AlgorithmProcedure GrowTree(S)Partition(S)

Procedure Partition (S)if (purity(S) gt dp or |S| lt ds) then returnfor each attribute A

evaluate splits on attribute AUse best split found (across all attributes) to partition

S into S1 S2 hellip Srfor i = 1 2 hellip r Partition(Si)

Other Types of ClassifiersFurther types of classifiers

Neural net classifiers Bayesian classifiers

Neural net classifiers use the training data to train artificial neural nets

23

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 24: Database Technology

Widely studied in AI wonrsquot cover here Bayesian classifiers use Bayes theorem which says

where p(cj | d) = probability of instance d being in class cj p(d | cj) = probability of generating instance d given class cj p(cj) = probability of occurrence of class cj and p(d) = probability of instance d occuring Naive Bayesian ClassifiersBayesian classifiers require

computation of p(d | cj) precomputation of p(cj) p(d) can be ignored since it is the same for all classes

To simplify the task naiumlve Bayesian classifiers assume attributes have independent distributions and thereby estimatep(d|cj) = p(d1|cj) p(d2|cj) hellip (p(dn|cj)Each of the p(di|cj) can be estimated from a histogram on di values for each class cj

the histogram is computed from the training instances Histograms on multiple attributes are more expensive to compute and storeRegressionRegression deals with the prediction of a value rather than a class

Given values for a set of variables X1 X2 hellip Xn we wish to predict the value of a variable Y

One way is to infer coefficients a0 a1 a1 hellip an such thatY = a0 + a1 X1 + a2 X2 + hellip + an Xn

Finding such a linear polynomial is called linear regression In general the process of finding a curve that fits the data is also called curve fitting

The fit may only be approximate because of noise in the data or because the relationship is not exactly a polynomial

Regression aims to find coefficients that give the best possible fitAssociation RulesRetail shops are often interested in associations between different items that people buy

Someone who buys bread is quite likely also to buy milkA person who bought the book Database System Concepts is quite likely also to buy the book Operating System Concepts

Associations information can be used in several ways Eg when a customer buys a particular book an online shop may suggest associated books

Association rules bread THORN milk DB-Concepts OS-Concepts THORN Networks

24

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 25: Database Technology

Left hand side antecedent right hand side consequentAn association rule must have an associated population the population consists of a set of instances Eg each transaction (sale) at a shop is an instance and the set of all transactions is the populationRules have an associated support as well as an associated confidence Support is a measure of what fraction of the population satisfies both the antecedent and the consequent of the ruleEg suppose only 0001 percent of all purchases include milk and screwdrivers The support for the rule is milk THORN screwdrivers is lowWe usually want rules with a reasonably high supportRules with low support are usually not very usefulConfidence is a measure of how often the consequent is true when the antecedent is true Eg the rule bread THORN milk has a confidence of 80 percent if 80 percent of the purchases that include bread also include milkUsually want rules with reasonably large confidenceFinding Association RuleWe are generally only interested in association rules with reasonably high support (eg support of 2 or greater)Naiumlve algorithm

1 Consider all possible sets of relevant items2 For each set find its support (ie count how many transactions purchase all items

in the set)H Large itemsets sets with sufficiently high support

3 Use large itemsets to generate association rulesH From itemset A generate the rule A - b THORNb for each b Icirc A

4 Support of rule = support (A)4 Confidence of rule = support (A ) support (A - b)

Other Types of AssociationsBasic association rules have several limitationsDeviations from the expected probability are more interestingEg if many people purchase bread and many people purchase cereal quite a few would be expected to purchase both (prob1 prob2)We are interested in positive as well as negative correlations between sets of itemsPositive correlation co-occurrence is higher than predictedNegative correlation co-occurrence is lower than predictedSequence associationscorrelationsEg whenever bonds go up stock prices go down in 2 daysDeviations from temporal patternsEg deviation from a steady growthEg sales of winter wear go down in summerNot surprising part of a known pattern Look for deviation from value predicted using past patternsClustering

25

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 26: Database Technology

Clustering Intuitively finding clusters of points in the given data such that similar points lie in the same cluster

Can be formalized using distance metrics in several waysEg Group points into k sets (for a given k) such that the average distance of points from the centroid of their assigned group is minimized

Centroid point defined by taking average of coordinates in each dimensionAnother metric minimize average distance between every pair of points in a cluster

Has been studied extensively in statistics but on small data setsData mining systems aim at clustering techniques that can handle very large data setsEg the Birch clustering algorithm (more shortly)

Hierarchical ClusteringExample from biological classification Other examples Internet directory systems (eg Yahoo more on this later)Agglomerative clustering algorithmsBuild small clusters then cluster small clusters into bigger clusters and so onDivisive clustering algorithmsStart with all items in a single cluster repeatedly refine (break) clusters into smaller ones

(b) Discuss the features of Web databases and Mobile databases (16) (JUNE 2010)Mobile databases

Recent advances in portable and wireless technology led to mobile computing a new dimension in data communication and processing

Portable computing devices coupled with wireless communications allow clients to access data from virtually anywhere and at any time

There are a number of hardware and software problems that must be resolved before the capabilities of mobile computing can be fully utilized

Some of the software problems ndash which may involve data management transaction management and database recovery ndash have their origins in distributed database systems

In mobile computing the problems are more difficult mainly The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery) The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

26

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 27: Database Technology

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching seamless roaming throughout a geographical region

Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the frequency spectrum which may cause interference with other appliances such as cordless telephones

27

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 28: Database Technology

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client A client may be unreachable because it is dozing ndash in an energy-conserving state

in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

28

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 29: Database Technology

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases

29

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 30: Database Technology

Whenever clients connect ndash through a process known in industry as synchronization of a client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etcWeb databasesA database system is essentially just a way to manage lists of information The information can come from a variety of sources For example it can represent research data business records customer requests sports statistics sales reports personal hobby information personnel records bug reports or student gradesThe power of a database system come in when the information you want to organize and manage becomes voluminous or complex so that your records become more burdensome than you care to deal with by handDatabases can be used by large corporations processing millions of transactions a day of course but even small-scale operations involving a single person maintaining information of personal interest may require a database Its not difficult to think of situations in which the use of a database can be beneficial because you neednt have huge amounts of information before that information becomes difficult to manageWeb Database ApplicationsDatabase systems are used now to provide services in ways that were not possible until relatively recently The manner in which many organizations use a database in conjunction with a Web site is a good example

Suppose your company has an inventory database that is used by the service desk staff when customers call to find out whether or not you have an item in stock and how much

30

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 31: Database Technology

it costs Thats a relatively traditional use for a database However if your company puts up a Web site for customers to visit you can provide an additional service a search engine that allows customers to determine item pricing and availability themselves

This gives your customers the information they want and the way you provide it is by supplying a database application to search the inventory information stored in your database for the items in question atilde automatically The customer gets the information immediately without being put on hold listening to annoying canned music or being limited by the hours your service desk is open And for every customer who uses your Web site thats one less phone call that needs to be handled by a person on the service desk payroll (Perhaps the Web site pays for itself this way)

But you can put the database to even better use than that Web-based inventory search requests can provide information not only to your customers but to you as well The queries tell you what your customers are looking for and the query results tell you whether or not youre able to satisfy their requests To the extent you dont have what they want youre probably losing business So it makes sense to record information about inventory searches what customers were looking for and whether or not you had it in stock Then you can use this information to adjust your inventory and provide better service to your customers

Another recent application for databases is to serve up banner advertisements on Web pages We dont like them any better than you do but the fact remains that they are a popular application for Web databases which can be used to store advertisements and retrieve them for display by a Web server

Three Tier Architecture

31

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 32: Database Technology

4 (a)With an example explain E-R Model in detail (JUNE 2010)Or

(b) (i) Explain E-R model with an example (8) (NOVDEC 2010)

Entity-Relationship Model The entity-relationship (E-R) data model perceives the real world as consisting of basic objects called entities and relationships among these objectsEntity SetsA database can be modeled as

a collection of entities Relationships among entities

An entity is an object that exists and is distinguishable from other objects Example specific person company event plant

An entity set is a set of entities of the same type that share the same properties Example set of all persons companies trees holidays

AttributesAn entity is represented by a set of attributes that is descriptive properties possessed by all members of an entity setExamplescustomer = (customer-name social-security customer-street customer-city)account = (account-number balance) Domain -- the set of permitted values for each attributeAttribute typesSimple and composite attributesSingle-valued and multi-valued attributesNull attributesDerived attributesRelationship SetsA relationship is an association among several entitiesExamples

Hayes depositor A-102customer-entity relationship set account entity

A relationship set is a mathematical relation among nsup32 entitieseach taken from entity sets (e1e2en)iumle1IcircE1e2IcircE2en Icirc En

where (e1e2en) is a relationship--Example (HayesA-102) Icircdepositor

An attribute can also be a property of a relationship set For instance the depositor relationship set between entity sets customer and account may have the attribute access-date

32

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 33: Database Technology

Degree of a Relationship Setsect Refers to number of entity sets that participate in a relationship setsect Relationship sets that involve two entity sets are binary (or degree two) Generally most

relationship sets in a database system are binarysect Relationship sets may involve more than two entity sets The entity sets customer loan

and branch may be linked by the ternary (degree three) relationship set CLB RolesEntity sets of a relationship need not be distinct

sect The labelsrdquomanagerrdquoandrdquoworkerrdquoare called roles they specify how employee entities interact via the works-for relationship set

sect Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles

sect Role labels are optional and are used to clarify semantics of the relationship Design Issues

sect Use of entity sets vs attributesChoice mainly depends on the structure of the enterprise being modeled and on the

semantics associated with the attribute in questionsect Use of entity sets vs relationship sets

Possible guideline is to designate a relationship set to describe an action that occurs between entities

sect Binary versus n-ary relationship setsAlthough it is possible to replace a nonbinary (n-ary for n gt 2) relationship set by a

number of distinct binary relationship sets a n-ary relationship set shows more clearly that several entities participate in a single relationshipMapping Cardinalities

sect Express the number of entities to which another entity can be associated via a relationship set

sect Most useful in describing binary relationship setssect For a binary relationship set the mapping cardinality must be one of the following typesndash One to onendash One to manyndash Many to one

33

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 34: Database Technology

ndash Many to manysect We distinguish among these types by drawing either a directed line( )signifying

ldquoonerdquoor an undirected line( )signifying ldquomanyrdquo between the relationship set and the entity set

One-To-One Relationship

sect A customer is associated with at most one loan via the relationship borrower sect A loan is associated with at most one customer via borrower

One-To-Many and Many-To-One Relationship

sect In the one-to-many relationship (a) a loan is associated with at most one customer via borrower a customer is associated with several (including 0) loans via borrower

sect In the many-to-one relationship (b) a loan is associated with several (including 0) customers via borrower a customer is associated with at most one loan via borrower

Many-To-Many Relationship

34

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 35: Database Technology

sect A customer is associated with several (possibly 0) loans via borrowersect A loan is associated with several (possibly 0) customers via borrower

Existence Dependenciessect If the existence of entity x depends on the existence of entity y then x is said to be

existence dependent on yndash y is a dominant entity (in example below loan)ndash x is a subordinate entity (in example below payment)sect If a loan entity is deleted then all its associated payment entities must be deleted also

E-R Diagram Componentssect Rectangles represent entity setssect Ellipses represent attributessect Diamonds represent relationship setssect Lines link attributes to entity sets and entity sets to relationship setssect Double ellipses represent multivalued attributessect Dashed ellipses denote derived attributessect Primary key attributes are underlined

Weak Entity Setssect An entity set that does not have a primary key is referred to as a weak entity setsect The existence of a weak entity set depends on the existence of a strong entity set it must

relate to the strong set via a one-to-many relationship setsect The discriminator (or partial key) of a weak entity set is the set of attributes that

distinguishes among all the entities of a weak entity setsect The primary key of a weak entity set is formed by the primary key of the strong entity set

on which the weak entity set is existence dependent plus the weak entity sets discriminator

sect We depict a weak entity set by double rectanglessect We underline the discriminator of a weak entity set with a dashed linesect payment-number -- discriminator of the payment entity setsect Primary key for payment -- (loan-number payment-number)

Specializationsect Top-down design process we designate subgroupings within an entity set that are

distinctive from other entities in the set

35

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 36: Database Technology

sect These subgroupings become lower-level entity sets that have attributes or participate in relationships that do not apply to the higher-level entity set

sect Depicted by a triangle component labeled ISA (ie savings-accountrdquois anrdquoaccount) Generalization

sect A bottom-up design process -- combine a number of entity sets that share the same features into a higher-level entity set

sect Specialization and generalization are simple inversions of each other they are represented in an E-R diagram in the same way

sect Attribute Inheritance -- a lower-level entity set inherits all the attributes and relationship participation of the higher-level entity set to which it is linked

Design Constraints on a Generalizationsect Constraint on which entities can be members of a given lower-level entity set

condition-defineduser-defined

sect Constraint on whether or not entities may belong to more than one lower-level entity set within a single generalization

disjointoverlapping

sect Completeness constraint -- specifies whether or not an entity in the higher-level entity set must belong to at least one of the lower-level entity sets within a generalization

TotalPartial

Aggregationsect Relationship sets borrower and loan-officer represent the same informationsect Eliminate this redundancy via aggregation

Treat relationship as an abstract entityAllows relationships between relationshipsAbstraction of relationship into new entity

sect Without introducing redundancy the following diagram represents thatA customer takes out a loanAn employee may be a loan officer for a customer-loan pair

E-R Design Decisionssect The use of an attribute or entity set to represent an objectsect Whether a real-world concept is best expressed by an entity set or a relationship setsect The use of a ternary relationship versus a pair of binary relationshipssect The use of a strong or weak entity setsect The use of generalization -- contributes to modularity in the designsect The use of aggregation -- can treat the aggregate entity set as a single unit without

concern for the details of its internal structureReduction of an E-R schema to Tables

sect Primary keys allow entity sets and relationship sets to be expressed uniformly as tables which represent the contents of the database

36

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 37: Database Technology

sect A database which conforms to an E-R diagram can be represented by a collection of tables

sect For each entity set and relationship set there is a unique table which is assigned the name of the corresponding entity set or relationship set

sect Each table has a number of columns (generally corresponding to attributes) which have unique names

sect Converting an E-R diagram to a table format is the basis for deriving a relational database design from an E-R diagram

Representing Entity Sets as Tablessect A strong entity set reduces to a table with the same attributessect A weak entity set becomes a table that includes a column for the primary key of the

identifying strong entity setRepresenting Relationship Sets as Tables

sect A many-to-many relationship set is represented as a table with columns for the primary keys of the two participating entity sets and any descriptive attributes of the relationship set

sect The table corresponding to a relationship set linking a weak entity set to its identifying strong entity set is redundant The payment table already contains the information that would appear in the loan-payment table (ie the columns loan-number and payment-number)

E-R Diagram for Banking Enterprise

37

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 38: Database Technology

Representing Generalization as Tablessect Method 1 Form a table for the generalized entity account Form a table for each entity set

that is generalized (include primary key of generalized entity set)sect Method 2 Form a table for each entity set that is generalized table

(b) Explain the features of Temporal amp Spatial Databases in detail (JUNE 2010)Or

(a) Give features of Temporal and Spatial Databases Temporal Database(DECEMBER 2010)

Temporal DatabaseTime Representation Calendars and Time Dimensions

Time is considered ordered sequence of points in some granularitybull Use the term choronon instead of point to describe minimum granularity

A calendar organizes time into different time units for conveniencebull Accommodates various calendars

Gregorian (western) Chinese Islamic Etc Point events

38

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 39: Database Technology

bull Single time point event Eg bank deposit

bull Series of point events can form a time series data Duration events

bull Associated with specific time period Time period is represented by start time and end time

Transaction timebull The time when the information from a certain transaction becomes valid

Bitemporal databasebull Databases dealing with two time dimensions

Incorporating Time in Relational Databases Using Tuple Versioning Add to every tuple

bull Valid start time bull Valid end time

Incorporating Time in Object-Oriented Databases Using Attribute Versioning

39

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 40: Database Technology

A single complex object stores all temporal changes of the object Time varying attribute

bull An attribute that changes over time Eg age

Non-Time varying attributebull An attribute that does not changes over time

Eg date of birthSpatial DatabaseTypes of Spatial Data

Point Data1048707 Points in a multidimensional space1048707 Eg Raster data such as satellite imagery where eachpixel stores a measured value1048707 Eg Feature vectors extracted from text

Region Data1048707 Objects have spatial extent with location and boundary1048707 DB typically uses geometric approximations constructed using line segments polygons etc called vector data

Types of Spatial Queries Spatial Range Queries

1048707 Find all cities within 50 miles of Madison1048707 Query has associated region (location boundary)1048707 Answer includes ovelapping or contained data regions

Nearest-Neighbor Queries1048707 Find the 10 cities nearest to Madison1048707 Results must be ordered by proximity

Spatial Join Queries1048707 Find all cities near a lake1048707 Expensive join condition involves regions and proximity

Applications of Spatial Data Geographic Information Systems (GIS)

1048707 Eg ESRIrsquos ArcInfo OpenGIS Consortium1048707 Geospatial information1048707 All classes of spatial queries and data are common

Computer-Aided DesignManufacturing1048707 Store spatial objects such as surface of airplane fuselage1048707 Range queries and spatial join queries are common

Multimedia Databases1048707 Images video text etc stored and retrieved by content1048707 First converted to feature vector form high dimensionality1048707 Nearest-neighbor queries are the most common

Single-Dimensional Indexes1048707 B+ trees are fundamentally single-dimensional indexes

40

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 41: Database Technology

1048707 When we create a composite search key B+ tree eg an index on ltage salgt we effectively linearize the 2-dimensional space since we sort entries first by age and then by sal Consider entrieslt11 80gt lt12 10gt lt12 20gt lt13 75gt

Multi-dimensional Indexes1048707 A multidimensional index clusters entries so as to exploit ldquonearnessrdquo in multidimensional space1048707 Keeping track of entries and maintaining a balanced index structure presents a challengeConsider entrieslt11 80gt lt12 10gtlt12 20gt lt13 75gt

Motivation for Multidimensional Indexes Spatial queries (GIS CAD)

1048707 Find all hotels within a radius of 5 miles from the conference venue1048707 Find the city with population 500000 or more that is nearest to Kalamazoo MI1048707 Find all cities that lie on the Nile in Egypt1048707 Find all parts that touch the fuselage (in a plane design)

Similarity queries (content-based retrieval)1048707 Given a face find the five most similar faces

Multidimensional range queries1048707 50 lt age lt 55 AND 80K lt sal lt 90K

Drawbacks An index based on spatial location needed

1048707 One-dimensional indexes donrsquot support multidimensional searching efficiently 1048707 Hash indexes only support point queries want to support range queries as well1048707 Must support inserts and deletes gracefully

Ideally want to support non-point data as well (eg lines shapes) The R-tree meets these requirements and variants are widely used today

41

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 42: Database Technology

R-Tree

R-Tree Properties Leaf entry = lt n-dimensional box rid gt

1048707 This is Alternative (2) with key value being a box1048707 Box is the tightest bounding box for a data object

Non-leaf entry = lt n-dim box ptr to child node gt1048707 Box covers all boxes in child node (in fact subtree)

All leaves at same distance from root Nodes can be kept 50 full (except root)

1048707 Can choose a parameter m that is lt= 50 and ensure that every node is at least m full

Example of R-Tree

Search for Objects Overlapping Box Q Start at root1 If current node is non-leaf for each entry ltE ptrgt if box E overlaps Q search subtree identified by ptr

42

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 43: Database Technology

2 If current node is leaf for each entry ltE ridgt if E overlaps Q rid identifies an object that might overlap Q Improving Search Using Constraints

It is convenient to store boxes in the R-tree as approximations of arbitrary regions because boxes can be represented compactly

But why not use convex polygons to approximate query regions more accurately1048707 Will reduce overlap with nodes in tree and reduce the number of nodes fetched by avoiding some branches altogether1048707 Cost of overlap test is higher than bounding box intersection but it is a main-memory cost and can actually be done quite efficiently Generally a win

Insert Entry ltB ptrgt Start at root and go down to ldquobest-fitrdquo leaf L

1048707 Go to child whose box needs least enlargement to cover B resolve ties by going to smallest area child

If best-fit leaf L has space insert entry and stop Otherwise split L into L1 and L21048707 Adjust entry for L in its parent so that the box now covers (only) L11048707 Add an entry (in the parent node of L) for L2 (This could cause the parent node to recursively split)

Splitting a Node during Insertion The entries in node L plus the newly inserted entry must be distributed between L1 and

L2 Goal is to reduce likelihood of both L1 and L2 being searched on subsequent queries Idea Redistribute so as to minimize area of L1 plus area of L2

R-Tree Variants The R tree uses the concept of forced reinserts to reduce overlap in tree nodes When a

node overflows instead of splitting1048707 Remove some (say 30 of the) entries and reinsert them into the tree1048707 Could result in all reinserted entries fitting on some existing pages avoiding a split

R trees also use a different heuristic minimizing box perimeters rather than box areas during insertion

Another variant the R+ tree avoids overlap by inserting an object into multiple leaves if necessary1048707 Searches now take a single path to a leaf at cost of redundancy

GiST The Generalized Search Tree (GiST) abstracts the ldquotreerdquo nature of a class of indexes

including B+ trees and R-tree variants1048707 Striking similarities in insertdeletesearch and even concurrency control algorithms make it possible to provide ldquotemplatesrdquo for these algorithms that can be customized to obtain the many different tree index structures1048707 B+ trees are so important (and simple enough to allow further specialization) that they are implemented specially in all DBMSs1048707 GiST provides an alternative for implementing other tree indexes in an ORDBS

Indexing High-Dimensional Data

43

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 44: Database Technology

Typically high-dimensional datasets are collections of points not regions1048707 Eg Feature vectors in multimedia applications1048707 Very sparse

Nearest neighbor queries are common1048707 R-tree becomes worse than sequential scan for most datasets with more than a dozen dimensions

As dimensionality increases contrast (ratio of distances between nearest and farthest points) usually decreases ldquonearest neighborrdquo is not meaningful1048707 In any given data set advisable to empirically test contrast

5 (a) Explain the features of Parallel and Text Databases in detail (JUNE 2010)Parallel DatabasesIntroductionParallel machines are becoming quite common and affordable

Prices of microprocessors memory and disks have dropped sharplyDatabases are growing increasingly large

Large volumes of transaction data are collected and stored for later analysis multimedia objects like images are increasingly stored in databases

Large-scale parallel database systems increasingly used for storing large volumes of data processing time-consuming decision-support queries providing high throughput for transaction processing

Parallelism in DatabasesData can be partitioned across multiple disks for parallel IOIndividual relational operations (eg sort join aggregation) can be executed in parallel

Data can be partitioned and each processor can work independently on its own partitionQueries are expressed in high level language (SQL translated to relational algebra)

Makes parallelization easierDifferent queries can be run in parallel with each otherConcurrency control takes care of conflictsThus databases naturally lend themselves to parallelismIO ParallelismReduce the time required to retrieve relations from disk by partitioning the relations on multiple disksHorizontal partitioning ndash tuples of a relation are divided among many disks such that each tuple resides on one diskPartitioning techniques (number of disks = n)

Round-robinSend the ith tuple inserted in the relation to disk i mod nHash partitioningChoose one or more attributes as the partitioning attributes

44

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 45: Database Technology

Choose hash function h with range 0hellipn ndash 1Let i denote result of hash function h applied tothe partitioning attribute value of a tuple Send tuple to disk iPartitioning techniques (cont)Range partitioningChoose an attribute as the partitioning attributeA partitioning vector [vo v1 vn-2] is chosenLet v be the partitioning attribute value of a tuple Tuples such that vi_ vi+1 go to disk I + 1 Tuples with v lt v0 go to disk 0 and tuples with v _ vn-2 go to disk n-1Eg with a partitioning vector [511] a tuple with partitioning attribute value of 2 will go to disk 0 a tuple with value 8 will go to disk 1 while a tuple with value 20 will go to disk2Comparison of Partitioning TechniquesEvaluate how well partitioning techniques support the following types of data access1 Scanning the entire relation2 Locating a tuple associatively ndash point queriesEg rA = 253 Locating all tuples such that the value of a given attribute lies within a specified range ndash range queriesEg 10 _ rA lt 25Round robinAdvantages

Best suited for sequential scan of entire relation on each query All disks have almost an equal number of tuples retrieval work is thus well balanced

between disksRange queries are difficult to process

No clustering -- tuples are scattered across all disksHash partitioningGood for sequential access

Assuming hash function is good and partitioning attributes form a key tuples will be equally distributed between disks

Retrieval work is then well balanced between disksGood for point queries on partitioning attribute

Can lookup single disk leaving others available for answering other queries Index on partitioning attribute can be local to disk making lookup and update more

efficientNo clustering so difficult to answer range queries

Range partitioningProvides data clustering by partitioning attribute valueGood for sequential accessGood for point queries on partitioning attribute only one disk needs to be accessedFor range queries on partitioning attribute one to a few disks may need to be accessed

minus Remaining disks are available for other queriesminus Good if result tuples are from one to a few blocks

45

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 46: Database Technology

minus If many blocks are to be fetched they are still fetched from one to a few disks and potential parallelism in disk access is wasted Example of execution skew

Partitioning a Relation across Disks If a relation contains only a few tuples which will fit into a single disk block then assign the relation to a single disk Large relations are preferably partitioned across all the available disks If a relation consists of m disk blocks and there are n disks available in the system then the relation should be allocated min(mn) disksHandling of SkewThe distribution of tuples to disks may be skewed mdash that is some disks have many tuples while others may have fewer tuplesTypes of skewAttribute-value skew Some values appear in the partitioning attributes of many tuples all the tuples with the same value for the partitioning attribute end up in the same partition ldquoCan occur with range-partitioning and hash-partitioningPartition skew With range-partitioning badly chosen partition vector may assign too many tuples to some partitions and too few to others Less likely with hash-partitioning if a good hash-function is chosenHandling Skew in Range-PartitioningTo create a balanced partitioning vector (assuming partitioning attribute forms a key of the relation)

Sort the relation on the partitioning attribute Construct the partition vector by scanning the relation in sorted order as follows

After every 1nth of the relation has been read the value of the partitioning attribute of the next tuple is added to the partition vector

n denotes the number of partitions to be constructed Duplicate entries or imbalances can result if duplicates are present in partitioning

attributesAlternative technique based on histograms used in practiceHandling Skew using HistogramsBalanced partitioning vector can be constructed from histogram in a relatively straightforward fashion

Assume uniform distribution within each range of the histogram Histogram can be constructed by scanning relation or sampling (blocks containing) tuples of the relation

46

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 47: Database Technology

Interquery ParallelismQueriestransactions execute in parallel with one anotherIncreases transaction throughput used primarily to scale up a transaction processing system to support a larger number of transactions per secondEasiest form of parallelism to support particularly in a shared memory parallel database because even sequential database systems support concurrent processingMore complicated to implement on shared-disk or shared-nothing architectures

Locking and logging must be coordinated by passing messages between processors Data in a local buffer may have been updated at another processor Cache-coherency has to be maintained mdash reads and writes of data in buffer must find

latest version of dataCache Coherency ProtocolExample of a cache coherency protocol for shared disk systems

Before readingwriting to a page the page must be locked in sharedexclusive mode On locking a page the page must be read from disk Before unlocking a page the page must be written to disk if it was modified

More complex protocols with fewer disk readswrites existCache coherency protocols for shared-nothing systems are similar Each database page is assigned a home processor Requests to fetch the page or write it to disk are sent to the homeprocessorIntraquery ParallelismExecution of a single query in parallel on multiple processorsdisks important for speeding up long-running queriesTwo complementary forms of intraquery parallelismIntraoperation Parallelism ndash parallelize the execution of each individual operation in the queryInteroperation Parallelism ndash execute the different operations in a query expression in parallelThe first form scales better with increasing parallelism because the number of tuples processed by each operation is typically more than the number of operations in a queryParallel SortRange-Partitioning SortChoose processors P0 Pm where m _ n -1 to do sortingCreate range-partition vector with m entries on the sorting attributesRedistribute the relation using range partitioning

all tuples that lie in the ith range are sent to processor Pi Pi stores the tuples it received temporarily on disk Di This step requires IO and communication overhead

Each processor Pi sorts its partition of the relation locally

47

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 48: Database Technology

Each processors executes same operation (sort) in parallel with other processors without any interaction with the others (data parallelism)Final merge operation is trivial range-partitioning ensures that for 1 jm the key values in processor Pi are all less than the key values in PjParallel External Sort-MergeAssume the relation has already been partitioned among disks D0 Dn-1 Each processor Pi locally sorts the data on disk DiThe sorted runs on each processor are then merged to get the final sorted outputParallelize the merging of sorted runs as follows

The sorted partitions at each processor Pi are range-partitioned across the processors P0 Pm-1

Each processor Pi performs a merge on the streams as they are received to get a single sorted run

The sorted runs on processors P0 Pm-1 are concatenated to get the final result

Parallel JoinThe join operation requires pairs of tuples to be tested to see if they satisfy the join condition and if they do the pair is added to the join outputParallel join algorithms attempt to split the pairs to be tested over several processors Each processor then computes part of the join locallyIn a final step the results from each processor can be collected together to produce the final resultPartitioned Join

For equi-joins and natural joins it is possible to partition the two input relations across the processors and compute the join locally at each processor

Let r and s be the input relations and we want to compute r and s each are partitioned into n partitions denoted r0 r1 rn-1 and s0 s1 sn-1Can use either range partitioning or hash partitioningr and s must be partitioned on their join attributes rA and sB) using the same range-partitioning vector or hash functionPartitions ri and si are sent to processor Pi

Each processor Pi locally computes Any of the standard join methods can be used

48

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 49: Database Technology

Fragment-and-Replicate JoinPartitioning not possible for some join conditions

eg non-equijoin conditions such as rA gt sBFor joins were partitioning is not applicable parallelization can be accomplished by fragment and replicate technique

Depicted on next slideSpecial case ndash asymmetric fragment-and-replicate

One of the relations say r is partitioned any partitioning technique can be used The other relation s is replicated across all the processors Processor Pi then locally computes the join of ri with all of s using any join technique

Both versions of fragment-and-replicate work with any join condition since every tuple in r can be tested with every tuple in sUsually has a higher cost than partitioning since one of therelations (for asymmetric fragment-and-replicate) or both relations have to be replicatedSometimes asymmetric fragment-and-replicate is preferable even though partitioning could be used

Eg say s is small and r is large and already partitioned It may becheaper to replicate s across all processors rather than repartition r and s on the join attributesPartitioned Parallel Hash-JoinParallelizing partitioned hash joinAssume s is smaller than r and therefore s is chosen as the build relationA hash function h1 takes the join attribute value of each tuple in s and maps this tuple to one of the n processorsEach processor Pi reads the tuples of s that are on its disk Di and sends each tuple to the appropriate processor based on hash function h1 Let si denote the tuples of relation s that aresent to processor PiAs tuples of relation s are received at the destination processors they are partitioned further using another hash function h2 which is used to compute the hash-join locallyOnce the tuples of s have been distributed the larger relation r is redistributed across the m processors using the hash function h1

Let ri denote the tuples of relation r that are sent to processor Pi

49

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 50: Database Technology

As the r tuples are received at the destination processors they are repartitioned using the function h2Each processor Pi executes the build and probe phases of the hash-join algorithm on the local partitions ri and s of r and s to produce a partition of the final result of the hash-joinHash-join optimizations can be applied to the parallel case

eg the hybrid hash-join algorithm can be used to cache some of the incoming tuples in memory and avoid the cost of writing them and reading them back in

Parallel Nested-Loop JoinAssume that

relation s is much smaller than relation r and that r is stored by partitioning there is an index on a join attribute of relation r at each of the

partitions of relation rUse asymmetric fragment-and-replicate with relation s being replicated and using the existing partitioning of relation rEach processor Pj where a partition of relation s is stored reads the tuples of relation s stored in Dj and replicates the tuples to every other processor PiAt the end of this phase relation s is replicated at all sites that store tuples of relation rEach processor Pi performs an indexed nested-loop join of relation s with the ith partition of relation rInteroperator ParallelismPipelined parallelism

Consider a join of four relations Set up a pipeline that computes the three joins in parallel

Let P1 be assigned the computation of And P2 be assigned the computation of

And P3 be assigned the computation of Each of these operations can execute in parallel sending result tuples it computes to the

next operation even as it is computing further results Provided a pipelineable join evaluation algorithm is used

Independent Parallelism

Consider a join of four relations

Let P1 be assigned the computation of

And P2 be assigned the computation of And P3 be assigned the computation of P1 and P2 can work independently in parallelP3 has to wait for input from P1 and P2

Can pipeline output of P1 and P2 to P3 combining independent parallelism and pipelined parallelism

Does not provide a high degree of parallelism useful with a lower degree of parallelism less useful in a highly parallel system

Query OptimizationQuery optimization in parallel databases is significantly more complex than query optimization in sequential databasesCost models are more complicated since we must take into account partitioning costs and issues such as skew and resource contention

50

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 51: Database Technology

When scheduling execution tree in parallel system must decide How to parallelize each operation and how many processors to use for it What operations to pipeline what operations to execute independently in parallel and

what operations to execute sequentially one after the otherDetermining the amount of resources to allocate for each operation is a problem

Eg allocating more processors than optimal can result in high communication overheadLong pipelines should be avoided as the final operation may wait a lot for inputs while holding precious resourcesText DatabasesA text database is a compilation of documents or other information in the form of a database in which the complete text of each referenced document is available for online viewing printing or downloading In addition to text documents images are often included such as graphs maps photos and diagrams A text database is searchable by keyword phrase or bothWhen an item in a text database is viewed it may appear in ASCII format (as a text file with the txt extension) as a word-processed file (requiring a program such as Microsoft Word) as an PDF) file When a document appears as a PDF file it is usually a scanned hardcopy of the original article chapter or bookA text databases are used by college and university libraries as a convenience to their students and staff Full-text databases are ideally suited to online courses of study where the student remains at home and obtains course materials by downloading them from the Internet Access to these databases is normally restricted to registered personnel or to people who pay a specified fee per viewed item Full-text databases are also used by some corporations law offices and government agencies

(b) Discuss the Rules Knowledge Bases and Image Databases (JUNE 2010)Rule-based systems are used as a way to store and manipulate knowledge to interpret information in a useful way They are often used in artificial intelligence applications and researchRule-based systems are specialized software that encapsulates ldquoHuman Intelligencerdquo like knowledge there by make intelligent decisions quickly and in repeatable form Also known as Rule Based Systems Expert Systems amp Artificial IntelligenceRule based systems are

ndash Knowledge based systemsndash Part of the Artificial Intelligence fieldndash Computer programs that contain some subject-specific knowledge of one or more

human expertsndash Made up of a set of rules that analyze user supplied information about a specific

class of problems ndash Systems that utilize reasoning capabilities and draw conclusions

Knowledge Engineering ndash building an expert system Knowledge Engineers ndash the people who build the system Knowledge Representation ndash the symbols used to represent the knowledge Factual Knowledge ndash knowledge of a particular task domain that is widely shared

51

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 52: Database Technology

Heuristic Knowledge ndash more judgmental knowledge of performance in a task domain Uses of Rule based Systems

Very useful to companies with a high-level of experience and expertise that cannot easily be transferred to other members

Solves problems that would normally be tackled by a medical or other professional Currently used in fields such as accounting medicine process control financial service

production and human resourcesApplicationsA classic example of a rule-based system is the domain-specific expert system that uses rules to make deductions or choices For example an expert system might help a doctor choose the correct diagnosis based on a cluster of symptoms or select tactical moves to play a gameRule-based systems can be used to perform lexical analysis to compile or interpret computer programs or in natural language processingRule-based programming attempts to derive execution instructions from a starting set of data and rules This is a more indirect method than that employed by an imperative programming language which lists execution steps sequentiallyConstructionA typical rule-based system has four basic components

A list of rules or rule base which is a specific type of knowledge base An inference engine or semantic reasoner which infers information or takes action based

on the interaction of input and the rule base The interpreter executes a production system program by performing the following match-resolve-act cycle Match In this first phase the left-hand sides of all productions are matched against

the contents of working memory As a result a conflict set is obtained which consists of instantiations of all satisfied productions An instantiation of a production is an ordered list of working memory elements that satisfies the left-hand side of the production

Conflict-Resolution In this second phase one of the production instantiations in the conflict set is chosen for execution If no productions are satisfied the interpreter halts

Act In this third phase the actions of the production selected in the conflict-resolution phase are executed These actions may change the contents of working memory At the end of this phase execution returns to the first phase

Temporary working memory A user interface or other connection to the outside world through which input and output

signals are received and sentComponents of an Rule Based System

Set of Rules ndash derived from the knowledge base and used by the interpreter to evaluate the inputted data

Knowledge Engineer ndash decides how to represent the experts knowledge and how to build the inference engine appropriately for the domain

Interpreter ndash interprets the inputted data and draws a conclusion based on the users responses

52

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 53: Database Technology

Problem-solving Models Forward-chaining ndash starts from a set of conditions and moves towards some conclusion Backward-chaining ndash starts with a list of goals and the works backwards to see if there

is any data that will allow it to conclude any of these goals Both problem-solving methods are built into inference engines or inference procedures

Advantages Provide consistent answers for repetitive decisions processes and tasks Hold and maintain significant levels of information Reduce employee training costs Centralize the decision making process Create efficiencies and reduce the time needed to solve problems Combine multiple human expert intelligences Reduce the amount of human errors Give strategic and comparative advantages creating entry barriers to competitors Review transactions that human experts may overlook

Disadvantages Lack human common sense needed in some decision making Will not be able to give the creative responses that human experts can give in unusual

circumstances Domain experts cannot always clearly explain their logic and reasoning Challenges of automating complex processes Lack of flexibility and ability to adapt to changing environments Not being able to recognize when no answer is available

Knowledge BasesKnowledge-based Systems Definition

A system that draws upon the knowledge of human experts captured in a knowledge-base to solve problems that normally require human expertise

Heuristic rather than algorithmic Heuristics in search vs in KBS general vs domain-specific Highly specific domain knowledge Knowledge is separated from how it is used

KBS = knowledge-base + inference engine

KBS Architecture

53

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 54: Database Technology

The inference engine and knowledge base are separated because the reasoning mechanism needs to be as stable as possible the knowledge base must be able to grow and change as knowledge is added this arrangement enables the system to be built from or converted to a shell It is reasonable to produce a richer more elaborate description of the typical expert

system A more elaborate description which still includes the components that are to be found in

almost any real-world system would look like thisKnowledge representation formalisms amp InferenceKR InferenceLogic Resolution principleProduction rules backward (top-down goal directed)

forward (bottom-up data-driven)Semantic nets amp Frames Inheritance amp advanced reasoningCase-based Reasoning Similarity basedKBS tools ndash Shells- Consist of KA Tool Database amp Development Interface- Inductive Shells

- simplest - example cases represented as matrix of known data(premises) and resulting effects - matrix converted into decision tree or IF-THEN statements- examples selected for the tool

Rule-based shells - simple to complex - IF-THEN rules

Hybrid shells - sophisticate amp powerful - support multiple KR paradigms amp reasoning schemes - generic tool applicable to a wide range

Special purpose shells - specifically designed for particular types of problems

54

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 55: Database Technology

- restricted to specialised problems-Scratch

- require more time and effort - no constraints like shells - shells should be investigated first

Some example KBSsDENDRAL (chemical)MYCIN (medicine)XCONRI (computer)Typical tasks of KBS(1) Diagnosis - To identify a problem given a set of symptoms or malfunctions

eg diagnose reasons for engine failure(2) Interpretation - To provide an understanding of a situation from available information eg DENDRAL(3) Prediction - To predict a future state from a set of data or observations eg Drilling Advisor PLANT(4) Design - To develop configurations that satisfy constraints of a design problem eg XCON(5) Planning - Both short term amp long term in areas like project management product development or financial planning eg HRM(6) Monitoring - To check performance amp flag exceptions eg KBS monitors radar data and estimates the position of the space shuttle(7) Control - To collect and evaluate evidence and form opinions on that evidence

eg control patientrsquos treatment(8) Instruction - To train students and correct their performance eg give medical students experience diagnosing illness (9) Debugging - To identify and prescribe remedies for malfunctions

eg identify errors in an automated teller machine network and ways to correct the errorsAdvantages

- Increase availability of expert knowledgeexpertise not accessibletraining future experts

- Efficient and cost effective- Consistency of answers- Explanation of solution- Deal with uncertainty

Limitations- Lack of common sense- Inflexible Difficult to modify

- Restricted domain of expertise- Lack of learning ability- Not always reliable

55

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 56: Database Technology

6 (a) Compare Distributed databases and conventional databases (16) (NOVDEC 2010)

DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES

mimics organisational structure with data

local access and autonomy without exclusion

cheaper to create and easier to expand

improved availabilityreliabilityperformance by removing reliance on a central site

Reduced communication overhead

Most data access is local less expensive and performs better Improved processing power

Many machines handling the database rather than a single server more complex to implement more costly to maintain security and integrity control standards and experience are lacking Design issues are more complex

7 (a) Explain the Multi-Version Locks and Recovery in Query Languages(NOVDEC 2010)

Multi-Version Locks Multiversion concurrency control (abbreviated MCC or MVCC) in the database field of computer science is a concurrency control method commonly used by database management systems to provide concurrent access to the database and in programming languages to implement transactional memoryFor instance a database will implement updates not by deleting an old piece of data and overwriting it with a new one but instead by marking the old data as obsolete and adding the newer version Thus there are multiple versions stored but only one is the latest This allows the database to avoid overhead of filling in holes in memory or disk structures but requires (generally) the system to periodically sweep through and delete the old obsolete data objects For a document-oriented database such as CouchDB Riak or MarkLogic Server it also allows

56

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 57: Database Technology

the system to optimize documents by writing entire documents onto contiguous sections of diskmdashwhen updated the entire document can be re-written rather than bits and pieces cut out or maintained in a linked non contiguous database structureMVCC also provides potential point in time consistent views In fact read transactions under MVCC typically use a timestamp or transaction ID to determine what state of the DB to read and read these versions of the data This avoids managing locks for read transactions because writes can be isolated by virtue of the old versions being maintained rather than through a process of locks or mutexes Writes affect future version but at the transaction ID that the read is working at everything is guaranteed to be consistent because the writes are occurring at a later transaction IDIn other words MVCC provides each user connected to the database with a snapshot of the database for that person to work with Any changes made will not be seen by other users of the database until the transaction has been committedMVCC uses timestamps or increasing transaction IDs to achieve transactional consistency MVCC ensures a transaction never has to wait for a database object by maintaining several versions of an object Each version would have a write timestamp and it would let a transaction (Ti) read the most recent version of an object which precedes the transaction timestamp (TS (Ti))If a transaction (Ti) wants to write to an object and if there is another transaction (Tk) the timestamp of Ti must precede the timestamp of Tk (ie TS(Ti) lt TS(Tk)) for the object write operation to succeed which is to say a write cannot complete if there are outstanding transactions with an earlier timestampEvery object would also have a read timestamp and if a transaction Ti wanted to write to object P and the timestamp of that transaction is earlier than the objects read timestamp (TS(Ti) lt RTS(P)) the transaction Ti is aborted and restarted Otherwise Ti creates a new version of P and sets the readwrite timestamps of P to the timestamp of the transaction TS (Ti)The obvious drawback to this system is the cost of storing multiple versions of objects in the database On the other hand reads are never blocked which can be important for workloads mostly involving reading values from the database MVCC is particularly adept at implementing true snapshot isolation something which other methods of concurrency control frequently do either incompletely or with high performance costsAt t1 the state of a DB could be

Time Object1 Object2t1 ldquoHellordquo ldquoBarrdquot2 ldquoFoordquo ldquoBarrdquoThis indicates that the current set of this database (perhaps a key-value store database) is Object1=Hello Object2=Bar Previously Object1 was Foo but that value has been superseded It is not deleted because the database holds ldquomultiple versionsrdquo but will be deleted laterIf a long running transaction starts a read operation it will operate at transaction t1 and see this state If there is a concurrent update (during that long-running read transaction) which deletes Object 2 and adds Object 3 = ldquofoo-barrdquo the database will look likeTime Object1 Object2 Object3

57

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 58: Database Technology

t2 ldquoHellordquo (deleted) ldquoFoo-Barrdquot1 ldquoHellordquo Bart0 ldquoHellordquo BarNow there is a new version as of transaction ID t2 Note critically that the long-running read transaction still has access to a coherent snapshot of the system at t1 even though the write transaction added data as of t2 so the read transaction is able to run in isolation from the update transaction that created the t2 values This is how MVCC allows isolated ACID reads without any locksRecovery

58

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 59: Database Technology

(b)Discuss clientserver model and mobile databases (16) (NOVDEC 2010)

59

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 60: Database Technology

Mobile Databases Recent advances in portable and wireless technology led to mobile computing a new

dimension in data communication and processing Portable computing devices coupled with wireless communications allow clients to

access data from virtually anywhere and at any time There are a number of hardware and software problems that must be resolved before the

capabilities of mobile computing can be fully utilized Some of the software problems ndash which may involve data management transaction

management and database recovery ndash have their origins in distributed database systems In mobile computing the problems are more difficult mainly

The limited and intermittent connectivity afforded by wireless communications The limited life of the power supply(battery)

60

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 61: Database Technology

The changing topology of the network In addition mobile computing introduces new architectural possibilities and

challengesMobile Computing Architecture

The general architecture of a mobile platform is illustrated in Fig 301

It is distributed architecture where a number of computers generally referred to as Fixed Hosts and Base Stations are interconnected through a high-speed wired network

Fixed hosts are general purpose computers configured to manage mobile units Base stations function as gateways to the fixed network for the Mobile Units

Wireless Communications ndash The wireless medium have bandwidth significantly lower than those of a wired

network The current generation of wireless technology has data rates range from

the tens to hundreds of kilobits per second (2G cellular telephony) to tens of megabits per second (wireless Ethernet popularly known as WiFi)

Modern (wired) Ethernet by comparison provides data rates on the order of hundreds of megabits per second

The other characteristics distinguish wireless connectivity options interference locality of access range support for packet switching

61

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 62: Database Technology

seamless roaming throughout a geographical region Some wireless networks such as WiFi and Bluetooth use unlicensed areas of the

frequency spectrum which may cause interference with other appliances such as cordless telephones

Modern wireless networks can transfer data in units called packets that are used in wired networks in order to conserve bandwidth

ClientNetwork Relationships ndash Mobile units can move freely in a geographic mobility domain an area that is

circumscribed by wireless network coverage To manage entire mobility domain is divided into one or more smaller

domains called cells each of which is supported by at least one base station

Mobile units be unrestricted throughout the cells of domain while maintaining information access contiguity

The communication architecture described earlier is designed to give the mobile unit the impression that it is attached to a fixed network emulating a traditional client-server architecture

Wireless communications however make other architectures possible One alternative is a mobile ad-hoc network (MANET) illustrated in 292

In a MANET co-located mobile units do not need to communicate via a fixed network but instead form their own using cost-effective technologies such as Bluetooth

In a MANET mobile units are responsible for routing their own data effectively acting as base stations as well as clients

Moreover they must be robust enough to handle changes in the network topology such as the arrival or departure of other mobile units

MANET applications can be considered as peer-to-peer meaning that a mobile unit is simultaneously a client and a server

Transaction processing and data consistency control become more difficult since there is no central control in this architecture

Resource discovery and data routing by mobile units make computing in a MANET even more complicated

Sample MANET applications are multi-user games shared whiteboard distributed calendars and battle information sharing

Characteristics of Mobile Environments The characteristics of mobile computing include

Communication latency Intermittent connectivity Limited battery life Changing client location

The server may not be able to reach a client

62

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 63: Database Technology

A client may be unreachable because it is dozing ndash in an energy-conserving state in which many subsystems are shut down ndash or because it is out of range of a base station

In either case neither client nor server can reach the other and modifications must be made to the architecture in order to compensate for this case

Proxies for unreachable components are added to the architecture For a client (and symmetrically for a server) the proxy can cache updates

intended for the server Mobile computing poses challenges for servers as well as clients

The latency involved in wireless communication makes scalability a problem Since latency due to wireless communications increases the time to service

each client request the server can handle fewer clients One way servers relieve this problem is by broadcasting data whenever possible A server can simply broadcast data periodically Broadcast also reduces the load on the server as clients do not have to maintain

active connections to it Client mobility also poses many data management challenges

Servers must keep track of client locations in order to efficiently route messages to them

Client data should be stored in the network location that minimizes the traffic necessary to access it

The act of moving between cells must be transparent to the client The server must be able to gracefully divert the shipment of data from one base to

another without the client noticing Client mobility also allows new applications that are location-based

Data Management Issues From a data management standpoint mobile computing may be considered a variation of

distributed computing Mobile databases can be distributed under two possible scenarios The entire database is distributed mainly among the wired components possibly

with full or partial replication A base station or fixed host manages its own database with a DBMS-like

functionality with additional functionality for locating mobile units and additional query and transaction management features to meet the requirements of mobile environments

The database is distributed among wired and wireless components Data management responsibility is shared among base stations or fixed

hosts and mobile units Data management issues as it is applied to mobile databases

Data distribution and replication Transactions models Query processing Recovery and fault tolerance

63

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 64: Database Technology

Mobile database design Location-based service Division of labor Security

Application Intermittently Synchronized Databases Whenever clients connect ndash through a process known in industry as synchronization of a

client with a server ndash they receive a batch of updates to be installed on their local database

The primary characteristic of this scenario is that the clients are mostly disconnected the server is not necessarily able reach them

This environment has problems similar to those in distributed and client-server databases and some from mobile databases

This environment is referred to as Intermittently Synchronized Database Environment (ISDBE)

The characteristics of Intermittently Synchronized Databases (ISDBs) make them distinct from the mobile databases are

A client connects to the server when it wants to exchange updates The communication can be unicast ndashone-on-one communication between

the server and the clientndash or multicastndash one sender or server may periodically communicate to a set of receivers or update a group of clients

A server cannot connect to a client at will The characteristics of ISDBs (contd)

Issues of wireless versus wired client connections and power conservation are generally immaterial

A client is free to manage its own data and transactions while it is disconnected It can also perform its own recovery to some extent

A client has multiple ways connecting to a server and in case of many servers may choose a particular server to connect to based on proximity communication nodes available resources available etc

(b)(ii) Discuss optimization and research issues (8) (NOVDEC 2010)

Optimization Query optimization is an important task in a relational DBMS Must understand optimization in order to understand the performance impact of a given

database design (relations indexes) on a workload (set of queries) Two parts to optimizing a query

1048707 Consider a set of alternative plans bull Must prune search space typically left-deep plans only1048707 Must estimate cost of each plan that is considered bull Must estimate size of result and cost for each plan node

bull Key issues Statistics indexes operator implementations Plan Tree of RA ops with choice of alg for each op

64

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 65: Database Technology

1048707 Each operator typically implemented using a `pullrsquo interface when an operator is `pulledrsquo for the next output tuples it `pullsrsquo on its inputs and computes them

Two main issues1048707 For a given query what plans are considered bull Algorithm to search plan space for cheapest (estimated) plan1048707 How is the cost of a plan estimated

Ideally Want to find best plan Practically Avoid worst plans We will study the System R approach

Schema for ExamplesSailors (sid integer sname string rating integer age real)Reserves (sid integer bid integer day dates rname string)

Similar to old schema rname added for variations Reserves

1048707 Each tuple is 40 bytes long 100 tuples per page 1000 pages Sailors

1048707 Each tuple is 50 bytes long 80 tuples per page 500 pagesQuery Blocks Units of Optimization

An SQL query is parsed into a collection of query blocks and these are optimized one block at a time

Nested blocks are usually treated as calls to a subroutine made once per outer tuple For each block the plans considered are

1048707 All available access methods for each reln in FROM clause1048707 All left-deep join trees (ie all ways to join the relations oneat-a-time with the inner reln in the FROM clause considering all reln permutations and join methods)

8 (a)Discuss multimedia databases in detail (8) (NOVDEC 2010)Multimedia databases

To provide such database functions as indexing and consistency it is desirable to store multimedia data in a database 1048707 Rather than storing them outside the database in a file system

The database must handle large object representation Similarity-based retrieval must be provided by special index structures Must provide guaranteed steady retrieval rates for continuous-media data

Multimedia Data Formats Store and transmit multimedia data in compressed form

1048707 JPEG and GIF the most widely used formats for image data 1048707 MPEG standard for video data use commonalties among a sequence of frames to achieve a greater degree of compression

MPEG-1 quality comparable to VHS video tape 1048707 Stores a minute of 30-frame-per-second video and audio in approximately 125 MB

MPEG-2 designed for digital broadcast systems and digital video disks negligible loss of video quality 1048707 Compresses 1 minute of audio-video to approximately 17 MB

Several alternatives of audio encoding

65

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 66: Database Technology

1048707 MPEG-1 Layer 3 (MP3) RealAudio WindowsMedia format etcContinuous-Media Data

Most important types are video and audio data Characterized by high data volumes and real-time information-delivery requirements

1048707 Data must be delivered sufficiently fast that there are no gaps in the audio or video 1048707 Data must be delivered at a rate that does not cause overflow of system buffers 1048707 Synchronization among distinct data streams must be maintained

video of a person speaking must show lips moving synchronously with the audio

Video Servers Video-on-demand systems deliver video from central video servers across a network to

terminals 1048707 must guarantee end-to-end delivery rates

Current video-on-demand servers are based on file systems existing database systems do not meet real-time response requirements

Multimedia data are stored on several disks (RAID configuration) or on tertiary storage for less frequently accessed data

Head-end terminals - used to view multimedia data 1048707 PCs or TVs attached to a small inexpensive computer called a set-top boxSimilarity-Based RetrievalExamples of similarity based retrieval

Pictorial data Two pictures or images that are slightly different as represented in the database may be considered the same by a user 1048707 eg identify similar designs for registering a new trademark

Audio data Speech-based user interfaces allow the user to give a command or identify a data item by speaking 1048707 eg test user input against stored commands

Handwritten data Identify a handwritten data item or command stored in the database

(b) Explain the features of active and deductive databases in detail (8)(NOVDEC 2010)

15 Deductive Databases SQL-92 cannot express some queries 1048707 Are we running low on any parts needed to build a ZX600 sports car 1048707 What is the total component and assembly cost to build a ZX600 at todays part prices Can we extend the query language to cover such queries 1048707 Yes by adding recursion Recursion in SQL The concepts discussed in this chapter are not included in the SQL-

92 standard However the revised version of the SQL standard SQL1999 includes support for recursive queries and IBMs DB2 system already supports recursive queries as required in SQL1999

151 Introduction to recursive queries

66

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 67: Database Technology

1511 Datalog SQL queries can be read as follows ldquoIf some tuples exist in the From tables that satisfy

the Where conditions then the Select tuple is in the answerrdquo Datalog is a query language that has the same if-then flavor

1048707 New The answer table can appear in the From clause ie be defined recursively1048707 Prolog style syntax is commonly used

The Problem with RA and SQL-92 Intuitively we must join Assembly with itself to deduce that trike contains spoke and tire

1048707 takes us one level down Assembly hierarchy1048707 To find components that are one level deeper (eg rim) need another join1048707 To find all components need as many joins as there are levels in the given instance

For any relational algebra expression we can create an Assembly instance for which some answers are not computed by including more levels than the number of joins in the expression

152 Theoretical Foundations The first approach to defining what a Datalog program means is called the least model

semantics and gives users a way to understand the program without thinking about how the program is to be executed That is the semantics is declarative like the semantics of relational calculus and not operational like relational algebra semantics

The second approach called the least fixpoint semantics gives a conceptual evaluationstrategy to compute the desired relation instances This serves as the basis for recursivequery evaluation in a DBMS

The fixpoint semantics is thus operational and plays a role analogous to that of relational algebra semantics for nonrecursive queries

i Least Model Semantics1048707 The least fixpoint of a function f is a fixpoint v of f such that every other fixpoint of f is smaller than or equal to v1048707 In general there may be no least fixpoint (we could have two minimal fixpoints neither of which is smaller than the other)1048707 If we think of a Datalog program as a function that is applied to a set of tuples and returns another set of tuples this function (fortunately) always has a least fixpoint

ii Safe Datalog Programs

67

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 68: Database Technology

Consider following program Complex Parts (Part)- Assembly (Part Subpart Qty) Qty gt2According to this rule complex part is defined to be any part that has more than two copies of any one subpart For each part mentioned in the Assembly relation we can easily check if it is a complex part Database systems disallow unsafe programs by requiring that every variable in the head of a rule must also appear in the body Such programs are said to be range restricted and every range-restricted Datalog program has a finite least model if theinput relation instances are finite

iii The Fixpoint Operator1048707 Let f be a function that takes values from domain D and returns values from D A value v in D is a fixpoint of f if f(v)=v1048707 Consider the fn double+ which is applied to a set of integers and returns a set of integers (Ie D is the set of all sets of integers) 1048707 Eg double+ (1 2 5) = 2 4 10 Union 1 2 5 1048707 the set of all integers is a fixpoint of double+ 1048707 the set of all even integers is another fixpoint of double+ it is smaller than the first fixpoint

iv Least Model = Least FixpointFurther every Datalog program is guaranteed to have a least model and the least model is equal to the least fixpoint of the program These results provide the basis for Datalog query processing Users can understand a program in terms of `If the body is true the head is also true thanks to the least model semantics The DBMS can compute the answer by repeatedly applying the program rules thanks to the least fixpoint semantics and the fact that the least model and the least fixpoint are identicalb Recursive queries with negationBig(Part) - Assembly(Part Subpt Qty) Qty gt2 not Small(Part)Small(Part) - Assembly(Part Subpt Qty) not Big(Part)1048707 If rules contain not there may not be a least fixpoint Consider the Assembly instance trike is the only part that has 3 or more copies of some subpart Intuitively it should be in Big and it will be if we apply Rule 1 first 1048707 But we have Small(trike) if Rule 2 is applied first 1048707 There are two minimal fixpoints for this program Big is empty in one and contains trike in the other (and all other parts are in Small in both fixpoints)

1531 Range-Restriction and NegationIf rules are allowed to contain not in the body the definition of range-restriction must be extended in order to ensure that all range-restricted programs are safe If a relation appears in the body of a rule preceded by not we call this a negated occurrence Relation occurrences in the body that are not negated are called positive occurrences A program is range restricted if every variable in the head of the rule appears in some positive relation occurrence in the body1532 Stratification

1048707 T depends on S if some rule with T in the head contains S or (recursively) some predicate that depends on S in the body1048707 Stratified program If T depends on not S then S cannot depend on T (or not T)1048707 If a program is stratified the tables in the program can be partitioned into strata

68

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 69: Database Technology

1048707 Stratum 0 All database tables 1048707 Stratum I Tables defined in terms of tables in Stratum I and lower strata

1048707 If T depends on not S S is in lower stratum than TRelational Algebra and Stratified DatalogSelection Result(Y)- R(X Y) X=cProjection Result(Y)- R(X Y)Cross-product Result(X Y U V)- R(X Y) S (U V)Set-difference Result(X Y)- R(X Y) not S(UV)Union Result(X Y)- R(X Y)Result(X Y)- S(X Y) The stratified BigSmall program is shown below in SQL 1999 notation with a final additional selection on Big2Big2 (Part) AS(SELECT A1Part FROM Assembly A1 WHERE Qty gt 2)Small2 (Part) AS((SELECT A2Part FROM Assembly A2)EXCEPT(SELECT B1Part from Big2 B1))SELECT FROM Big2 B21533 Aggregate OperationsSELECT A Part SUM (AQty)FROM Assembly AGROUP BY A PartNumParts (Part SUM (ltQtygt))- Assembly (Part Subpt Qty)1048707 The lt hellip gt notation in the head indicates grouping the remaining arguments are the GROUP BY fields1048707 In order to apply such a rule must have all of Assembly relation available1048707 Stratification with respect to use of lt hellip gt is the usual restriction to deal with this problemsimilar to negation154 Efficient evaluation of recursive queries1048707 Repeated inferences When recursive rules are repeatedly applied in the naiumlve way we make the same inferences in several iterations1048707 Unnecessary inferences Also if we just want to find the components of a particular part say wheel computing the fixpoint of the Comp program and then selecting tuples with wheel in the first column is wasteful in that we compute many irrelevant facts1541 Fixpoint Evaluation without Repeated InferencesAvoiding Repeated Interferences1048707 Seminaive Fixpoint Evaluation Avoid repeated inferences by ensuring that when a rule is applied at least one of the body facts was generated in the most recent iteration (Which means this inference could not have been carried out in earlier iterations) 1048707 for each recursive table P use a table delta_P to store the P tuples generated in the previous iteration 1048707 Rewrite the program to use the delta tables and update the delta tables between iterations

69

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70

Page 70: Database Technology

Comp (Part Subpt)- Assembly (Part Part2 Qty) delta_Comp (Part2 Subpt)1542 Pushing Selections to Avoid Irrelevant Inferences SameLev (S1 S2)- Assembly (P1 S1 Q1) Assembly (P2 S2 Q2)SameLev (S1S2) - Assembly(P1S1Q1) SameLev(P1P2) Assembly(P2S2Q2)1048707 There is a tuple (S1S2) in SameLev if there is a path up from S1 to some node and down to S2 with the same number of up and down edges

1048707 Suppose that we want to find all SameLev tuples with spoke in the first column We should ldquopushrdquo this selection into the fixpoint computation to avoid unnecessary inferences1048707 But we canrsquot just compute SameLev tuples with spoke in the first column because someother SameLev tuples are needed to compute all such tuplesSameLev (spoke seat)- Assembly (wheel spoke 2) SameLev (wheel frame) Assembly (frame seat 1)The Magic Sets Algorithm1048707 Idea Define a ldquofilterrdquo table that computes allrelevant values and restrict the computationof SameLev to infer only tuples with arelevant value in the first columnMagic_SL (P1)- Magic_SL (S1) Assembly (P1 S1 Q1)Magic (spoke)SameLev(S1S2) - Magic_SL(S1) Assembly(P1S1Q1) Assembly(P2S2Q2)SameLev(S1S2)- Magic_SL(S1) Assembly(P1S1Q1)SameLev(P1P2) Assembly(P2S2Q2)The Magic Sets program rewriting algorithm can be summarized as followsAdd `Magic Filters Modify each rule in the program by adding a `Magic condition to the body that acts as a filter on the set of tuples generated by this ruleDefine the `Magic relations We must create new rules to define the `Magic relations Intuitively from each occurrence of an output relation R in the body of a program rule we obtain a rule defining the relation Magic R

70