47
Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Embed Size (px)

Citation preview

Page 1: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Presented by: Chaitanya Sambhara

- Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Page 2: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Index

• Why this topic• Introduction• What is a probabilistic database• Why probabilistic databases/ the need• Where did this concept come from and gains• Uncertainties and their types• Dependencies and their types• Constraints• Drawbacks• Current research and challenges

Page 3: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Why this topic ?

• The traditional databases are deterministic.

• Every item is either in the database or is not, 0 or 1.

• But modern enterprise applications need to deal with lots of imprecise data: different representation of the same object in multiple data sources.

• Imprecise data is best modeled as probabilistic data, and managed by a probabilistic database system.

Page 4: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li
Page 5: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Relational Databases

• Usually a relational database (RDB) is defined a finite collection of relations where each relation is a subset of cartesian product of sets referred to as domains.

• Where as a probabilistic databases provide means of representing types of information that cannot be captured by relational database.

Page 6: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

What is a probabilistic Database?

• Traditional databases store solid “facts” that can be considered certain

• In many cases, we don’t know things precisely

• I saw a bird, but not sure if it was a dove or a sparrow

• Uncertain, incomplete data is becoming more and more common

Page 7: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Why probabilistic databases ?

• A conventional relational database model cannot handle probabilistic information well, though there is much application of it.

• To treat such information in relational models, database users themselves need to determine how to represent such information and define its semantics.

• Thus, there has been a need for probabilistic database

models which can represent such information.

Page 8: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Need for probabilistic databases

• Example : Supplier X supplies part Y

• From the point of view of data-modelling many situations require more complex forms of information.

Page 9: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Need for probabilistic databases

• How reliable is part Y when supplied by the supplier X?

• Is the probability that a person of type X will purchase product Y greater if that person has also purchased product Z ?

• If X is known how much additional information about Y is provided by knowledge of Z ?

Page 10: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Where did this concept come from ?

Application pull:

• The need to manage imprecision in data

Technology push:

• Advances in query processing techniques

Page 11: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

The gain ?

• Probabilistic Databases provide a framework that generalizes the relational database model.

• Extends all the concepts that have been developed to deal with the collection of yes/no facts to apply.

• Also to facts about which one is uncertain (probabilistic databases) or about one has vague or fuzzy information (fuzzy databases).

Page 12: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Types of Uncertainty's related to probabilistic databases

• Tuple-level uncertainty

• Attribute-level uncertainty

Page 13: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Tuple Level Uncertainty

• All attributes in a tuple are known precisely; existence of the tuple is uncertain.

• Tuple (“GSU”, ...) will be present in the answer with some uncertainty

Page 14: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Trio

• Witness 1: Car green; driver was a man, 20 years.

• Witness 2: Car yellow; driver was 35 years old. •

• Statements are partly contradictory. Statements are partly independent.

• Need to gather facts and weigh options: – What is the probability that the driver was a man?

• What is the probability that the car is green and the driver is 35 years old? – (Which witness do you trust; do you trust a witness all or nothing?)

Page 15: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Inconsistent Data

• Types of imprecision addressed:

• Data from different sources is contradictory.

• Data is uncertain, hence, arguably, probabilistic

• Query answers are probabilistic

Page 16: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

A probabilistic would…

• Give better recall and precision!

• Needs to support disjoint tuple events

Page 17: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example for a tuple level uncertainty

Example for a tuple level uncertainty

•This figure represents that a tuple (p,e) exists in the database with probability 0.4, which is indicated by a special attribute named pS .However, these models assume that the existence of each tuple is independent, and are not able to represent dependency among tuples such as “at least one tuple of two or more tuples must exist in a database.”

State Event pS

p e 0.4

p f 0.6

q e 0.5

q f 0.4

Page 18: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example for a tuple level uncertainty

Example for a tuple level uncertainty

•For example, in a this figure , the probability that neither of the two tuples (p,e) and (p,f) exists in the database is given by (1-0.4) * (1-0.6) = 0.24

•0.24>0 , so it cannot represent the case that at least one tuple of the two must exist in the database.

State Event pS

p e 0.4

p f 0.6

q e 0.5

q f 0.4

Page 19: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li
Page 20: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Attribute level uncertainty

• Tuples (identified by keys) exist for certain; an attribute value is however uncertain

• Tomorrow temperature will be somewhere between 50F and 70F

Page 21: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Attribute level uncertainty

• Attribute-level uncertainty means the attribute values are uncertain.

• The attributes are independent of each other

Page 22: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example for attribute level uncertainty

• Tuples exist with certainty

• Temperature at time t1 at location 1 etc.

• But the attribute values (temperatures) are uncertain

• In particular, each temperature value is a Gaussian

Page 23: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example for attribute level uncertainty

Example for attribute level uncertainty

•Tuples exist with certainty.

•• Temperature at time t1 at location 1 etc.

•• But the attribute values (temperatures) are uncertain

Time temp low

temp high

1 20 21

2 22 23

3 18 19

Page 24: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Probabilistic Data Dependencies

• In simple relational database applications data dependencies can be inferred from the meaning of the attributes as determined by the application.

• In complex probabilistic databases though dependencies are not known beforehand.

Page 25: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

An important analogue of Probabilistic database design

• To determine which dependencies exist

• Relative strength of various dependencies when they do not exist in absolute sense.

• Hence the probabilistic dependencies can broadly be classified in to three types.

Page 26: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Types of probabilistic dependencies

• JOIN, FUNCTIONAL and MULTIVALUED dependencies.

• They generalize the database in the sense that-

• They apply both to probabilistic and relational databases• They are straightforward, by application of results from

information theory.

Page 27: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Join Dependency

• A join dependency hold when a relational or probabilistic database system can be decomposed into a collection of (sub)systems such that the system is equal to the join of the subsystems.

Page 28: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Functional Dependencies

• In relational databases XY

• For probabilistic databases the same applies, that is, once the tuple values for attributes X are known, then there is no uncertainty regarding the possible tuple values for attributes Y.

Page 29: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Multivalued Dependencies

• X Y

• What does it mean ?

• X Y if knowledge of R-(Y-X) gives the same or less information about Y than does the knowledge of X alone

Page 30: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example of multivalued dependency

Drinkers(name, addr, phones, beersLiked)• A drinker’s phones are independent of the

beers they like.– name->->phones and name ->->beersLiked.

• Thus, each of a drinker’s phones appears with each of the beers they like in all combinations.

• This repetition is unlike FD redundancy.name->addr is the only FD.

Page 31: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Constraints

• The constraints for probabilistic databases are nearly same as those for relational databases.

• Such as domain, referential, key and entity.

• Intergrity constraints might have issues

Page 32: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Integrity constraints

• Integrity constraints might not be forced enforced or satisfied.

• In some environments checking consistency of constraints might be too expensive.

• Like work loads with high update rates.

Page 33: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Query Evaluation

• Queries in

• Databases and Information retrieval :

Fundamentally different

Page 34: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example

Page 35: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example

Page 36: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li
Page 37: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li
Page 38: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Query Evauluation

• But what if a query looks like:

Page 39: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Support for uncertain matches

• SQL query with uncertain predicates.

• Assign probability to each tuple according to how well it matches the uncertain predicates.

• Outputs are now ranked and have probabilistic semantics

Page 40: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example

Page 41: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Example

Page 42: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Drawbacks

• Commercial database systems have been relatively slow to incorporate imprecision capabilities

• Practitioners are also concerned with compatibility.

• Many of the query evaluation algorithms for matching imprecise data or for processing imprecise queries are fairly complex and inefficient.

Page 43: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Current research and challenges

• Preliminary work in this area includes investigation of incompleteness in logical databases.

• The design of a knowledge-rich system that can deal with various aspects of imprecision

Page 44: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Current research and challenges

• The method of data organization presumes knowledge of probabilistic independence among attributes, which may not be known at design time.

• A model is proposed with some restrictions such as: each relational scheme contains a set of deterministic key attributes.

Page 45: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Bibliography• Query Evaluation on Probabilistic Databases By : Christopher R´e, Nilesh Dalvi and Dan Suciu University of Washington year 2006

http://www.cs.washington.edu/homes/nilesh/papers/debul06probdb.pdf

• Probabilistic databases with correlated tuples By : Prithviraj Sen, Amol Deshpande and Lise Getoor A journal published at Dept of Computer science University of Maryland College Park 2006 http://www.cs.purdue.edu/probdb/updb06/prob_db_wkshp.pdf

• A General Probabilistic database model By :Veronica Biazzo, Alfredo Ferro, Angelo Gilio and Rosalba Giugno Dipartimento di Matematica e Informatica, Viale A. Doria, 6 - 95125 Catania, Italy December 18, 2002 http://www.dmi.unict.it/~giugno/ipmu00.pdf

Page 46: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Bibliography• Models for Incomplete and Probabilistic information By : Todd J. Green and Val Tannen Department of Computer and Information Science University of Pennsylvania International Workshop on Inconsistency and Incompleteness in

Databases March 26, 2006 http://www.cis.upenn.edu/~tjgreen/iidb06-talk.pdf • Efficient Management of Inconsistent Databases. By : Ariel Fuxman, Elham Fazli, and Renee J. Miller. In ACM SIGMOD

Conference , 155-166, 2005 http://www.cs.toronto.edu/~afuxman/publications/sigmod05.pdf

• A System for Efficient Query Answering Over Inconsistent Databases. Ariel Fuxman, Diego Fuxman, and Renee J. Miller. In International Conference on Very Large Databases (VLDB) , 1354-

1357, 2005 http://www.cs.toronto.edu/~afuxman/publications/vldbdemo05.pdf

Page 47: Presented by: Chaitanya Sambhara - Graduate Student - CSc 8712 ( Advanced Database Systems ) - Spring 2008 - Instructor : Dr Yingshu Li

Suggestions/ Question time