12
PIP: A Database System for Great and Small Expectations Oliver Kennedy and Christoph Koch Department of Computer Science Cornell University, Ithaca, NY, USA {okennedy, koch}@cs.cornell.edu Abstract— Estimation via sampling out of highly selective join queries is well known to be problematic, most notably in online aggregation. Without goal-directed sampling strategies, samples falling outside of the selection constraints lower estimation efficiency at best, and cause inaccurate estimates at worst. This problem appears in general probabilistic database systems, where query processing is tightly coupled with sampling. By committing to a set of samples before evaluating the query, the engine wastes effort on samples that will be discarded, query processing that may need to be repeated, or unnecessarily large numbers of samples. We describe PIP, a general probabilistic database system that uses symbolic representations of probabilistic data to defer computation of expectations, moments, and other statistical measures until the expression to be measured is fully known. This approach is sufficiently general to admit both continuous and discrete distributions. Moreover, deferring sampling enables a broad range of goal-oriented sampling-based (as well as exact) integration techniques for computing expectations, allows the selection of the integration strategy most appropriate to the expression being measured, and can reduce the amount of sampling work required. We demonstrate the effectiveness of this approach by showing that even straightforward algorithms can make use of the added information. These algorithms have a profoundly positive impact on the efficiency and accuracy of expectation computations, particularly in the case of highly selective join queries. I. I NTRODUCTION Uncertain data comes in many forms: Statistical models, scientific applications, and data extraction from unstructured text are all forms of uncertain data. Measurements have error margins while model predictions are often drawn from well known distributions. Traditional database management systems (DBMS) are ill-equipped to manage this kind of uncertainty. For example, consider a risk-management appli- cation that uses statistical models to evaluate the long term effects of corporate decisions and policies. This application may use a DBMS to store predictions and statistical measures (e.g., error bounds) of those predictions. However, arbitrary queries made on the predictions do not translate naturally into queries on the corresponding statistical measures. A user who requires error bounds on the sum of a join over several tables of predictions must first obtain a formula for computing those bounds, assuming a closed form formula even exists. Probabilistic database management systems [3], [4], [21], [5], [11], [17], [18], [10], [20] aim at providing better sup- port for querying uncertain data. Queries in these systems preserve the statistical properties of the data being queried, allowing users to obtain metrics about and representations of query results. The previously mentioned risk-management application, built on top of a probabilistic database, could use the database itself to obtain error bounds on the results of arbitrary queries over its predictions. By encoding the statistical model for its predictions in the database itself, the risk-management application could even use the probabilistic database to estimate complex functions over many correlated variables in its model. In effect, the application could compute all of its predictions within the probabilistic database in the first place. Few systems are general enough to efficiently query prob- abilistic data defined over both discrete and continuous distri- butions. Those that are generally rely on sampling to estimate desired values, as exact solutions can be hard to obtain. If a query contains a selection predicate, samples violating the predicate are dropped and do not contribute to the expectation. The more selective the predicate, the more samples are needed to maintain consistent accuracy. For example, a query may combine a model predicting customer profits with a model for predicting dissatisfied customers, perhaps as a result of a corporate decision to use a cheaper, but slower shipping company. If the query asks for profit loss due to dissatisfied customers, the query need only consider profit from customers under those conditions where the customer is dissatisfied (ie, the underlying model may include a correlation between ordering patterns and dependence on fast shipping). Without knowing the likelihood that customer A is satisfied, the query engine must over-provision and waste time generat- ing large numbers of samples, or risk needing to re-evaluate the query if additional samples are needed. This problem is well known in online aggregation, but ignored in general-purpose (i.e., both discrete and continuous) probabilistic databases. Example 1.1: Suppose a database captures customer orders expected for the next quarter, including prices and destinations of shipment. The order prices are uncertain, but a probability distribution is assumed. The database also stores distributions of shipping durations for each location. Here are two c-tables defining such a probabilistic database: Order Cust ShipTo Price Joe NY X 1 Bob LA X 3 Shipping Dest Duration NY X 2 LA X 4

PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

PIP: A Database System for Great and SmallExpectations

Oliver Kennedy and Christoph KochDepartment of Computer Science

Cornell University, Ithaca, NY, USA{okennedy, koch}@cs.cornell.edu

Abstract— Estimation via sampling out of highly selective joinqueries is well known to be problematic, most notably in onlineaggregation. Without goal-directed sampling strategies, samplesfalling outside of the selection constraints lower estimationefficiency at best, and cause inaccurate estimates at worst. Thisproblem appears in general probabilistic database systems, wherequery processing is tightly coupled with sampling. By committingto a set of samples before evaluating the query, the engine wasteseffort on samples that will be discarded, query processing thatmay need to be repeated, or unnecessarily large numbers ofsamples.

We describe PIP, a general probabilistic database systemthat uses symbolic representations of probabilistic data to defercomputation of expectations, moments, and other statisticalmeasures until the expression to be measured is fully known.This approach is sufficiently general to admit both continuousand discrete distributions. Moreover, deferring sampling enablesa broad range of goal-oriented sampling-based (as well as exact)integration techniques for computing expectations, allows theselection of the integration strategy most appropriate to theexpression being measured, and can reduce the amount ofsampling work required.

We demonstrate the effectiveness of this approach by showingthat even straightforward algorithms can make use of the addedinformation. These algorithms have a profoundly positive impacton the efficiency and accuracy of expectation computations,particularly in the case of highly selective join queries.

I. INTRODUCTION

Uncertain data comes in many forms: Statistical models,scientific applications, and data extraction from unstructuredtext are all forms of uncertain data. Measurements haveerror margins while model predictions are often drawn fromwell known distributions. Traditional database managementsystems (DBMS) are ill-equipped to manage this kind ofuncertainty. For example, consider a risk-management appli-cation that uses statistical models to evaluate the long termeffects of corporate decisions and policies. This applicationmay use a DBMS to store predictions and statistical measures(e.g., error bounds) of those predictions. However, arbitraryqueries made on the predictions do not translate naturally intoqueries on the corresponding statistical measures. A user whorequires error bounds on the sum of a join over several tablesof predictions must first obtain a formula for computing thosebounds, assuming a closed form formula even exists.

Probabilistic database management systems [3], [4], [21],[5], [11], [17], [18], [10], [20] aim at providing better sup-port for querying uncertain data. Queries in these systems

preserve the statistical properties of the data being queried,allowing users to obtain metrics about and representationsof query results. The previously mentioned risk-managementapplication, built on top of a probabilistic database, coulduse the database itself to obtain error bounds on the resultsof arbitrary queries over its predictions. By encoding thestatistical model for its predictions in the database itself, therisk-management application could even use the probabilisticdatabase to estimate complex functions over many correlatedvariables in its model. In effect, the application could computeall of its predictions within the probabilistic database in thefirst place.

Few systems are general enough to efficiently query prob-abilistic data defined over both discrete and continuous distri-butions. Those that are generally rely on sampling to estimatedesired values, as exact solutions can be hard to obtain. Ifa query contains a selection predicate, samples violating thepredicate are dropped and do not contribute to the expectation.The more selective the predicate, the more samples are neededto maintain consistent accuracy. For example, a query maycombine a model predicting customer profits with a modelfor predicting dissatisfied customers, perhaps as a result ofa corporate decision to use a cheaper, but slower shippingcompany. If the query asks for profit loss due to dissatisfiedcustomers, the query need only consider profit from customersunder those conditions where the customer is dissatisfied(ie, the underlying model may include a correlation betweenordering patterns and dependence on fast shipping).

Without knowing the likelihood that customer A is satisfied,the query engine must over-provision and waste time generat-ing large numbers of samples, or risk needing to re-evaluate thequery if additional samples are needed. This problem is wellknown in online aggregation, but ignored in general-purpose(i.e., both discrete and continuous) probabilistic databases.

Example 1.1: Suppose a database captures customer ordersexpected for the next quarter, including prices and destinationsof shipment. The order prices are uncertain, but a probabilitydistribution is assumed. The database also stores distributionsof shipping durations for each location. Here are two c-tablesdefining such a probabilistic database:

Order Cust ShipTo PriceJoe NY X1

Bob LA X3

Shipping Dest DurationNY X2

LA X4

Page 2: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

We assume a suitable specification of the joint distribution p ofthe random variables X1, . . . , X4 occurring in this database.

Now consider the queryselect expected_sum(O.Price)from Order O, Shipping Swhere O.ShipTo = S.Destand O.Cust = ’Joe’and S.Duration >= 7;

asking for the expected loss due to late deliveries to customersnamed Joe, where the product is free if not delivered withinseven days. This can be approximated by Monte Carlo sam-pling from p, where q represents the result of the sum aggregatequery on a sample, here

q(~x) ={x1 . . . x2 ≥ 70 . . . otherwise.

In a naive sample-first approach, if x2 ≥ 7 is a relativelyrare event, a large number of samples will be required tocompute a good approximation to the expectation. Moreover,the profit x1 is independent from the shipping time x2. Despitethis, samples for x1 are still discarded if the constraint on x2

is not met.

A. ContributionsSelective queries exemplify the need for contextual in-

formation when computing expectations and moments. Thispaper presents PIP, a highly extensible, general probabilisticdatabase system built around this need for information. PIPevaluates queries on symbolic representations of probabilisticdata, computing a complete representation of the probabilisticexpression to be evaluated before an expectation or moment istaken. To our knowledge, PIP is the first probabilistic databasesystem supporting continuous distributions to evaluate queriesin this way.

PIP’s approach encompasses and extends the strengths ofdiscrete systems that use c-tables such as Trio [21], MystiQ[4], and MayBMS [1], as well as the generality of thesample-first approach taken by MCDB [10]. It supports bothdiscrete and continuous probability distributions, statisticaldependencies definable by queries, expectations of aggregatesand distinct-aggregates with or without group-by, and the com-putation of confidences. The detailed technical contributionsof this paper are as follows.• We propose PIP, the first probabilistic database system

based on c-tables to efficiently support continuous prob-ability distributions.

• We show how PIP acts as a generalizable framework forexploiting information about distributions beyond simplesampling functionality (e.g., inverse cdfs) to enhancequery processing speed and accuracy. We demonstratethis framework by implementing several traditional sta-tistical optimizations within it.

• We propose a technique for identifying variable indepen-dence in c-table conditions and exploit it to acceleratesampling.

• We provide experimental evidence for the competitive-ness of our approach. We compare PIP with a reim-plementation of the refined sample-first approach taken

by MCDB by using a common codebase (both systemsare implemented on top of Postgres) to enable faircomparison. We show that PIP’s framework performsconsiderably better than MCDB over a wide range ofqueries, despite applying only a relatively straightforwardset of statistical optimizations. Even in the worst case,PIP remains competitive with MCDB (it essentially neverdoes more work).

II. BACKGROUND AND RELATED WORK

The estimation of probabilities of continuous distributionsfrequently devolves into the computation of complex integrals.PIP’s architecture allows it to identify cases where efficientalgorithms exist to obtain a solution. For more complex prob-lems not covered by these cases, PIP relies on Monte Carlointegration [14], a conceptually simple technique that allowsfor the (approximate) numerical integration of even the mostgeneral functions. Conceptually, to compute the expectation offunction q(~x), one simply approximates the integral by takingn samples ~x1, . . . , ~xn for ~X from their distribution p( ~X) andtaking the average of the function evaluated on all n values.

In general, even taking a sample from a complicated PDFis difficult. Constraints imposed by queries break traditionalMonte Carlo assumptions of normalization on p( ~X) and re-quire that the sampling technique account for them or lose pre-cision. A variety of techniques exist to address this problem,from straightforward rejection sampling, where constraint-violating samples are repeatedly discarded, to more heavyduty Markov-chain Monte Carlo (MCMC, cf. e.g., [6]) styletechniques such as the Metropolis-Hastings algorithm [13],[6].

Recently, a paper on the MCDB system[10] has promoted anintegrated sampling-based approach to probabilistic databases.Conceptually, MCDB uses a sample-first approach: it firstcomputes samples of entire databases and then processesqueries on these samples. This is a very general and flexibleapproach, largely due to its modular approach to proba-bility distributions via black box sample generators calledVG Functions. Using Tuple-Bundles, a highly compressedrepresentation of the sampled database instances, MCDBshares computation across instances where possible duringquery evaluation.

Conditional tables (c-tables, [9]) are relational tables inwhich tuples have associated conditions expressed as booleanexpressions over comparisons of random variables and con-stants. C-tables are a natural way to represent the deterministicskeleton of a probabilistic relational database in a succinct andtabular form. That is, complete information about uncertaindata is encoded using random variables, excluding only spec-ifications of the joint probability distribution of the randomvariables themselves. This model allows representation ofinput databases with nontrivial statistical dependencies that arenormally associated with graphical models.

For discrete probabilistic databases, a canon of systems hasbeen developed that essentially use c-tables, without referringto them as such. MystiQ [4] uses c-tables internally for queryprocessing but uses a simpler model for input databases. Trio

Page 3: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

[21] uses c-tables with additional syntactic sugar and callsconditions lineage. MayBMS [1] uses a form of c-tables calledU-relations that define how relational algebra representationsof queries can encode the corresponding condition transfor-mations.

ORION [18] is a probabilistic database management sys-tem for continuous distributions that can alternate betweensampling and transforming distributions. However, their rep-resentation system is not based on c-tables but essentiallyon the world-set decompositions of [2], a factorization basedapproach related to graphical models. Selection queries in thismodel may require an exponential blow-up in the representa-tion size, while selections are efficient in c-tables.

A. C-tables

A c-table over a set of variables is a relational table1

extended by a column for holding a local condition for each tu-ple. A local condition is a Boolean combination (using “and”,“or”, and “not”) of atomic conditions, which are constructedfrom variables and constants using =, <, ≤, 6=, >, and ≥.The fields of the remaining data columns may hold domainvalues or variables.

Given a variable assignment θ that maps each variable to adomain value and a condition φ, θ(φ) denotes the conditionobtained from φ by replacing each variable X occurring in itby θ(X). Analogously, θ(~t) denotes the tuple obtained fromtuple ~t by replacing all variables using θ.

The semantics of c-tables are defined in terms of possibleworlds as follows. A possible world is identified with avariable assignment θ. A relation R in that possible worldis obtained from its c-table CR as

R := {| θ(~t) | (~t, φ) ∈ CR, θ(φ) is true |}.

That is, for each tuple (~t, φ) of the c-table, where φ is thelocal condition and ~t is the remainder of the tuple, θ(~t) existsin the world if and only if θ(φ) is true. Note that each c-table has at least one possible world, but worlds constructedfrom distinct variable assignments do not necessarily representdifferent database instances.

B. Relational algebra on c-tables

Evaluating relational algebra on c-tables (and without theslightest difference, on probabilistic c-tables, since probabili-ties need not be touched at all) is surprisingly straightforward.The evaluation of the operators of relational algebra on mul-tiset c-tables is summarized in Figure 1. An explicit operator“distinct” is used to perform duplicate elimination.

Example 2.1: We continue the example from the introduc-tion. The input c-tables are

COrder = {| ((Joe,NY,X1), true), ((Bob, LA,X3), true) |}

CShipping = {| ((NY,X2), true), ((LA,X4), true) |}.

1In the following, we use a multiset semantics for tables: Tables maycontain duplicate tuples. Sets transformations are defined in comprehensionnotation {| · | · |} with ∈ as an iterator. Transformations preserve duplicates.We use ] to denote bag union, which can be thought of as list concatenationif the multisets are represented as unsorted lists.

Cσψ(R) = {| (~r, φ ∧ ψ[~r]) | (~r, φ) ∈ CR |}. . . ψ[~r] denotes ψ with each reference to

a column A of R replaced by ~r.A.

Cπ ~A(R) = {| (~r. ~A, φ) | (~r, φ) ∈ CR |}CR×S = {| (~r,~s, φ ∧ ψ) | (~r, φ) ∈ CR, (~s, ψ) ∈ CS |}CR∪S = CR ] CS

Cdistinct(R) = {| (~r,∨{φ | (~r, φ) ∈ CR}) | (~r, ·) ∈ CR |}

CR−S = {| (~r, φ ∧ ψ) | (~r, φ) ∈ Cdistinct(R),

if (~r, π) ∈ Cdistinct(S) then ψ := ¬πelse ψ := true |}

Fig. 1. Relational algebra on c-tables.

The relational algebra query is

πPrice(σShipTo=Dest(σCust=′Joe′(Order)× σDuration≥7(Shipping))).

We compute

CσCust=′Joe′ (Order) = {| ((Joe,NY,X1), true) |}CσDuration≥7(Shipping) =

{| ((NY,X2), X2 ≥ 7), ((LA,X4), X4 ≥ 7) |}CσCust=′Joe′ (Order)×σDuration≥7(Shipping) =

{| ((Joe,NY,X1, NY,X2), X2 ≥ 7),((Joe,NY,X1, LA,X4), X4 ≥ 7) |}

C. Probabilistic c-tables; expectations

A probabilistic c-table (cf. [7], [11]) is a c-table in whicheach variable is simply considered a (discrete or continuous)random variable, and a joint probability distribution is givenfor the random variable. As a convention, we will denote thediscrete random variables by ~X and the continuous ones by ~Y .Throughout the paper, we will always assume without sayingthat discrete random variables have a finite domain.

We will assume a suitable function p( ~X = ~x, ~Y = ~y)specifying a joint distribution which is essentially a PDF onthe continuous and a probability mass function on the discretevariables. To clarify this, p is such that we can define theexpectation of a function q as

E[q] =∑

~x

∫y1

· · ·∫

yn

p(~x, ~y) · q(~x, ~y) d~y ≈ 1n·

n∑i=1

q(~xi, ~yi)

given samples (~x1, ~y1), . . . , (~xn, ~yn) from the distribution p.We can specify events (sets of possible worlds) via Boolean

conditions φ that are true on a possible world (given byassignment) θ iff the condition obtained by replacing eachvariable x occurring in φ by θ(x) is true. The characteristicfunction χφ of condition (event) φ returns 1 on a variableassignment if it makes φ true and returns zero otherwise. Theprobability Pr[φ] of event φ is simply E[χφ].

The expected sum of a function h applied to the tuples ofa table R,

Page 4: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

select expected_sum(h(*)) from R;

can be computed as

E[ ∑

~t∈R

h(~t)]

= E[ ∑

(t,φ)∈CR

χφ·h(t)]

=∑

(t,φ)∈CR

E[χφ·(h◦t)

](the latter by linearity of expectation). Here t(~x, ~y) denotesthe tuple t, where any variable that may occur is replaced bythe value assigned to it in (~x, ~y).

Example 2.2: Returning to our running example, for CR ={| (x1, x2 ≥ 7) |}, the expected sum of prices is∑

(t,φ)∈CR

∫x1

· · ·∫

x4

p(~x) · χφ(~x) · t(~x).Price d~y =∫x1

· · ·∫

x4

p(~x) · χX2≥7(~x) · x1 d~y.

Counting and group-by. Expected count aggregates areobviously special cases of expected sumaggregates where his a constant function 1. We generally consider grouping by(continuously) uncertain columns to be of doubtful value.Group-by on nonprobabilistic columns (i.e., which contain norandom variables) poses no difficulty in the c-tables frame-work: the above summation simply proceeds within groups oftuples from CR that agree on the group columns. In particular,by delaying any sampling process until after the relationalalgebra part of the query has been evaluated on the c-tablerepresentation, we find it easy to create as many samples aswe need for each group in a goal-directed fashion. This is aconsiderable strong point of the c-tables approach used in PIP.

III. DESIGN OF THE PIP SYSTEM

Representing the uncertain components of a query’s outputsymbolically as a c-table makes a wide variety of integrationtechniques available for use in evaluating the statistical charac-teristics of the expression. If our risk-management applicationassumes a general model of customer profit and customersatisfaction that relies on queries to create correlations betweenthem, the sampler can detect this lack of dependency, estimateprofit and probability of dissatisfaction separately, and com-bine the two afterwards. Even with relatively straightforwardintegration techniques, additional knowledge of this form hasa profoundly positive impact on the efficiency and accuracywith which expectations of query results can be computed.

Accuracy is especially relevant in cases where the integralhas no closed form and exact methods are unavailable. Thisis the case in a surprising range of practical applications,even when strong simplifying assumptions are made about theinput data. For example, even if the input data contains onlyindependent variables sampled from well-studied distributions(e.g., the normal distribution), it is still possible for queriesto create complex statistical dependencies in their own right.It is well known, at least in the case of discrete and finiteprobability distributions, that relational algebra on block-independent-disjoint tables can construct any finite probabilitydistribution [15], [9].

Query Evaluation

Computing probabilities,moments, and statistical tests

Query plans on c-tables

Data Store

(Probabilistic) c-tables

Succinct, interchangeable representation of joint distributions

of random variables

Fig. 2. Pip Query Engine Architecture

A. Symbolic RepresentationPIP represents probabilistic data values symbolically via

random variables defined in terms of parametrized probabilitydistribution classes. PIP supports several generic classes ofprobability distributions (e.g., Normal, Uniform, Exponen-tial, Poisson), and may be extended with additional classes.Variables are treated as opaque while they are manipulatedby traditional relational operators. The resulting symbolicrepresentation is a c-table. As the final stage of the query,special operators defined within PIP compute expectations andmoments of the uncertain data, or sample the data to generatehistograms.

These expectation operators are invoked with a losslessrepresentation of the expression to be evaluated. Because thevariables have been treated as opaque, the expectation oper-ator can obtain information about the distribution a variablecorresponds to. Similarly, the lossless representation allowson-the-spot generation of samples if necessary; There is nobias from samples shared between multiple query runs.

Developers can (but need not) provide supplemental infor-mation (e.g., functions defining the PDF and the CDF) aboutdistributions they extend PIP with. The operator can exploitthis additional information to accelerate the sampling process,or potentially even sidestep it entirely. For example, if a queryasks for the probability that a variable will fall within specifiedbounds, the expectation operator can compute it with at mosttwo evaluations of the variable’s CDF.

Because the symbolic representation PIP uses is lossless,intermediate query results or views may be materialized. Ex-pectations of values in these views or subsequent queries basedon them will not be biased by estimation errors introducedby materializing the view. This is especially useful when asignificant fraction of query processing time is devoted tomanaging deterministic data (eg, to obtain parameters for themodel’s variables). Not only can this accelerate processingof commonly used subqueries, but it makes online samplingfeasible; the sampler need not evaluate the entire query fromscratch to generate additional samples.

Example 3.1: Consider the running example in the contextof c-tables. The result of the relational algebra part of theexample query can be easily computed as

R Price ConditionY1 Y2 ≥ 7

without looking at p. This c-table compactly represents all data

Page 5: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

still relevant after the application of the relational algebra partof the query, other than p, which remains unchanged. Samplingfrom R to computeselect expected_sum(Price) from R;

is a much more focused effort. First, we only have to considerthe random variables relating to Joe; but determining thatrandom variable Y2 is relevant while Y4 is not requiresexecuting a query involving a join. We want to do this queryfirst, before we start sampling.

Second, assume that delivery times are independent fromsales volumes. Then we can approximate the query result byfirst sampling an Y2 value and only sampling an Y1 valueif Y2 ≥ 7. Otherwise, we use 0 as the Y1 value. If Y2 ≥ 7is relatively rare (e.g., the average shipping times to NY arevery slow, with a low variance), this may reduce the amountof samples for Y1 that are first computed and then discardedwithout seeing use considerably. If CDFs are available, wecan of course do even better.

B. Random Variables

At the core of PIP’s symbolic representation of uncertaintyis the random variable. The simplest form of a randomvariable in PIP consists of a unique identifier, a subscript (formulti-variate distributions), a distribution class, and a set ofparameters for the distribution.

For example, we write

[Y ⇒ Normal(µ, σ2)]

to represent a normally distributed random variable X withmean µ and standard deviation σ2. Multivariate distributionsmay be specified using array notation. For instance,

[Y [n] ⇒MVNormal(µ, σ2, N)]

Thus, when instantiating a random variable, users specifyboth a distribution for the variable to be sampled from, and aparameter set for that distribution. As a single variable mayappear simultaneously at multiple points within the database,the unique identifier is used to ensure that sampling processgenerates consistent values for the variable within a givensample.

Rather than storing random variables directly, PIP employsthe equation datatype, a flattened parse tree of an arithmeticexpression, where leaves are random variables or constants.Because an equation itself describes a random variable, albeita composite one, we refer to equations and random variablesinterchangeably. Observe that while we limit our implementa-tion to arithmetic operators, any non-recursive expression maybe similarly represented. The equation datatype can be usedto encode any target expression accepted by PostgreSQL.

Random variable equations can be combined freely withconstant expressions, both in the target clause of a selectstatement and in the where clause. All targets includingrandom variable equations are encoded as random variableequations themselves. All where clauses including randomvariable equations are encoded as c-table conditions, thecontext of the row. As this allows a variable to appear severaltimes in a given context, the variable’s identifier is used as

part of the seed for the pseudorandom number generator usedby the sampling process.

C-table conditions allow PIP to represent per-tuple uncer-tainty. Each tuple is tagged with a condition that must holdfor the variable to be present in the table. C-table conditionsare expressed as a boolean formula of atoms, arbitrary in-equalities of random variables. The independent probability,or confidence of the tuple is the probability of the conditionbeing satisfied.

Given tables in which all conditions are conjunctions ofatomic conditions and the query does not employ duplicateelimination, then all conditions in the output table are con-junctions. Thus it makes sense to particularly optimise thisscenario [1]. In the case of positive relational algebra with theduplicate elimination operator (i.e., we trade duplicate elimi-nation against difference), we can still efficiently maintain theconditions in DNF, i.e., as a simple disjunction of conjunctionsof atomic conditions.

Without loss of generality, the model can be limited toconditions that are conjunctions of constraint atoms. Gener-ality is maintained by using bag semantics to encode dis-junctions; Disjunctive terms are encoded as separate rows,and the DISTINCT operator is used to coalesce terms. Thisrestriction provides several benefits. First, constraint validationis simplified; A pairwise comparison of all atoms in the clauseis sufficient to catch the inconsistencies listed above. As anadditional benefit, if all atoms of a clause define convex andcontiguous regions in the space ~x, ~y, these same properties arealso shared by their intersection.

C. Condition Inconsistency

Conditions can become inconsistent by combining contra-dictory conditions using conjunction, which may happen inthe implementations of the operators selection, product, anddifference. If such tuples are discovered, they may be freelyremoved from the c-table.

A condition is consistent if there is a variable assignmentthat makes the condition true. For general boolean formulas,deciding consistency is computationally hard. But we do notneed to decide it during the evaluation of relational algebraoperations. Rather, we exploit straightforward cases of in-consistency to clean-up c-tables and reduce their sizes. Werely on the later Monte Carlo simulation phase to enforce theremaining inconsistencies.

1) The consistency of conditions not involving variablevalues is always immediately apparent.

2) Conditions Xi = c1 ∧Xi = c2 with constants c1 6= c2are always inconsistent.

3) Equality conditions over continuous variables Yj = (·),with the exception of the identity Yj = Yj , are notinconsistent but can be treated as such (the probabilitymass will always be zero). Similarly, conditions Yj 6=(·), with the exception of Yj 6= Yj , can be treated astrue and removed or ignored.

4) Other forms of inconsistency can also be detected whereefficient techniques to do so are known.

Page 6: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

Algorithm 3.2: Checking the consistency of a condition1) consistencyCheck(ConditionSet C)2) foreach Discrete condition [X = c1] ∈ C3) if ∃[X = c2] ∈ C s.t. c1 6= c2, return Inconsistent.4) foreach Continuous variable group K (See Section IV-A)5) initialize bounds map S0 s.t. S0[X] = [−∞,∞] ∀X ∈ K6) while St 6= St−1 (incrementing t > 0 each iteration)7) let St = St−1

8) foreach equation E ∈ K9) if at most 1 variable in E is unbounded

(use the bounded variables to shrink variable bounds)10) foreach X , St[X] = St−1[X] ∩ tightenN (X, E, St)

(Where N = the degree of E)(computing bounds from some equations may be slow)

11) if tightenN has not been defined, skip E.12) if ∃X s.t. St[X] = ∅, return Inconsistent.13) if no Eqns were skipped return Consistent. else return Consistent.

1) tighten1(Variable X, Equation E, Map S)(example of variable bounding for degree 1 polynomial)

2) Express E in normal form aX + bY + cZ + . . . > 03) if a > 0, return [−(b ·max(S[Y ])+c ·max(S[Z])+ . . .)/a,∞]4) if a < 0, return [−∞,−(b·max(S[Y ])+c·max(S[Z])+. . .)/a]

Strong consistency guarantees returned by this algorithm are marked in bold,while weak ones are marked in italics.

With respect to discrete variables, inconsistency detectionmay be further simplified. Rather than using abstract repre-sentations, every row containing discrete variables may beexploded into one row for every possible valuation. Conditionatoms matching each variable to its valuation are used toensure mutual exclusion of each row. Thus, discrete variablecolumns may be treated as constants for the purpose of consis-tency checks. As shown in [1], deterministic database queryoptimizers do a satisfactory job of ensuring that constraintsover discrete variables are filtered as soon as possible.

This sampling algorithm is presented in Algorithm 3.2.Due to space constraints only tighten1 is presented in thispaper, but all polynomial equations may be handled using asimilar, albeit more complex enumeration of coefficients. ThePIP implementation currently limits users to simple algebraicoperators, thus all variable expressions are polynomial. How-ever, since consistency checking is optional, more complexequations can simply be ignored.

D. DistributionsAs variables are defined in terms of parametrized distri-

bution classes, PIP’s infrastructure is agnostic to the imple-mentation of the underlying distributions. When defining adistribution, programmers need only include a mechanismfor sampling from that distribution, much like [10]’s VGFunctions. However, PIP is not limited to simple samplingfunctionality. If it is possible to efficiently compute or estimatethe distribution’s probability density function (PDF ), cumu-lative distribution function (CDF ), and/or inverse cumulativedistribution function (CDF−1), these may be included toimprove PIP’s efficiency.

Distribution specific values like the PDF , CDF and in-verse CDF are used to demonstrate what can be achievedwith PIP’s framework. Further distribution-specific values likeweighted-sampling, mean, entropy, and the higher momentscan be used by more advanced statistical methods to achieveeven better performance. The process of defining a variabledistribution is described further in Section V.

Though PIP abstracts the details of a variable’s distributionfrom query evaluation, it distinguishes between discrete andcontinuous distributions. As described in Section II, existingresearch into c-tables has demonstrated efficient ways ofquerying variables sampled from discrete distributions. PIPemploys similar techniques when it is possible to do so.

IV. SAMPLING AND INTEGRATION

Conceptually, query evaluation in PIP is broken up intotwo components: Query and Sampling. PIP relies on Postgresto evaluate queries; As described in Section II, a queryrewriting pass suffices to translate c-tables relational algebraextensions into traditional relational algebra. Details on howquery rewriting is implemented are provided in Section V.

As the query is being evaluated, special sampling operatorsin the query are used to transform random variable expressionsinto histograms, expectations, and other moments. Both thecomputation of moments and probabilities in the general casereduces to numerical integration, and a dominant techniquefor doing this is Monte Carlo simulation. The approximatecomputation of expectation

E[χφ · (h ◦ t)] =1n·

n∑i=1

p(~yi) · χφ(~yi) · h(t(~yi)) (1)

faces a number of difficulties. In particular, samples for whichχφ is zero do not contribute to an expectation. If φ is a veryselective condition, most samples do not contribute to thesummation computation of the approximate expectation. (Thisis closely related to the most prominent problem in onlineaggregation systems [8], [16], and also in MCDB).

Example 4.1: Consider a row containing the variable

[Y ⇒ Normal(µ = 5, σ2 = 10)]

and the condition predicate (Y > −3) and (Y < 2). Theexpectation of the variable Y in the context of this row is not5. Rather the expectation is taken only over samples of Y thatfall in the range (−3, 2), (coming out to approximately 0.17).

A. Sampling Techniquesa) Rejection Sampling: One straightforward approach to

this problem is to perform rejection sampling; sample setsare repeatedly generated until a sufficient number of viable(satisfying) samples have been obtained.

However, without scaling the number of samples takenbased on E[χφ], information can get very sparse and theapproximate expectations will have a high relative error.Unfortunately, as the probability of satisfying the constraintdrops, the work required to produce a viable sample increases.Consequently, any mechanism that can improve the chancesof satisfying the constraint is beneficial.

b) Sampling using inverse CDFs: As an alternative togenerator functions, PIP can also use the inverse-transformmethod [12]. If available, the distribution’s inverse-CDF func-tion is used to translate a uniform-random number in the range[0, 1] to the variable’s distribution.

This technique makes constrained sampling more efficient.If the uniform-random input is selected from the range

Page 7: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

[CDF (a), CDF (b)], the generated value is guaranteed tofall in the range [a, b]. Even if precise constraints can notbe obtained, this technique still reduces the volume of thesampling space, increasing the probability of getting a usefulsample.

c) Exploiting independence: Prior to sampling, PIP sub-divides constraint predicates into minimal independent subsets;sets of predicates sharing no common variables. When de-termining subset independence, variables representing distinctvalues from a multivariate distribution are treated as the set ofall of their component variables. For example, consider the onerow c-table of nullary schema (i.e., there is only a conditioncolumn)

R φ2

(Y1 > 4) ∧ ([Y1 · Y2] > Y3) ∧ (A < 6)

In this case, the atoms (Y1 > 4) and ([Y1 ·Y2] > Y3) form oneminimal independent subset, while (A < 6) forms another.

Because these subsets share no variables, each may besampled independently. Sampling fewer variables at a time notonly reduces the work lost generating non-satisfying samples,but also decreases the frequency with which this happens.

d) Metropolis: A final alternative available to PIP, isthe Metropolis algorithm [13]. Starting from an arbitrarypoint within the sample space, this algorithm performs arandom walk weighted towards regions with higher probabilitydensities. Samples taken at regular intervals during the randomwalk may be used as samples of the distribution.

The Metropolis algorithm has an expensive startup cost,as there is a lengthy ‘burn-in’ period while it generates asufficiently random initial value. Despite this startup cost, thealgorithm typically requires only a relatively small number ofsteps between each sample. We can estimate the work requiredfor both Metropolis and Naive rejection sampling.

Wmetropolis = Cburn in + [# samples] · Csteps per sample

Wnaive =1

1− P [reject]· [# samples]

By generating a small number of samples for the subgroup,PIP can generate a rough estimate of P [reject] and decidewhich approach is less expensive.

B. Row-Level Sampling Operators

Evaluating a query on a probabilistic table (or tables)produces as output another probabilistic table. Though the rawprobabilistic data has value, the ultimate goal is to computestatistical aggregates: expectations, moments, etc. To achievethis goal, PIP provides a set of sampling operators: functionsthat convert probabilistic data into deterministic aggregates.

Example 4.2: Consider the c-tables

R A φ1

5 (Y1 > 4)S B φ2

X (Y2 > 2)

The queryselect A * B as C from R, S;

produces the result table

T C φ5 ·X (Y1 > 4) ∧ (Y2 > 2)

A human may find it useful to know that for a given possibleworld, T contains one row with C equal to 5 · Y1 in worldsdescribed by variables Y1 > 4 and Y2 > 2, and is empty in allother worlds. However, analysis of large volumes of data inthis form requires the ability to summarize and/or aggregate.Analyzing a table containing several hundred or more rows ofthis form becomes easier if a histogram is available.

Sampling operators take in an expression to be evaluatedand the expression’s context, or boolean formula of constraintsassociated with the expression’s row and output a histogram,expectation, or higher moment for the expression under theconstraints of its context. As described in Section II, withoutloss of generality, it is possible to express the context as a setof conditions to be combined conjunctively.

This paper focuses predominantly on sampling operatorsthat follow per-row sampling semantics. Under these seman-tics, each row is sampled independently. The aggregate inquestion is taken for only over the volume of probability spacedefined by the expression’s context for each row. For example,in the case of Monte-Carlo sampling, samples are generatedfor each row, but only samples satisfying the row’s context areconsidered. All other samples are discarded. If the context isunsatisfiable, a value of NAN will result.

The choice to focus on per-row sampling operators ismotivated by efficiency concerns. If the results table is largerthan main memory, the sampling process can become IO-bound. Per-row sampling operators require only a singlepass (or potentially a second pass if additional precision isrequired) over the results. While we do consider table-widesampling semantics out of necessity for some aggregates, thedevelopment of additional table-wide techniques is beyond thescope of this paper.

Note that the resulting aggregates are still probabilistic data.For example, the expectation of a given cell is computed in thecontext of the cell’s row; The expectation is computed onlyover those worlds where the row’s condition holds, as in allother worlds the row does not exist. In order to compute theprobability of satisfying the row’s condition, also referred toas the row’s confidence, we define a confidence operator. Ifthe confidence operator is present, all conditions applying tothe row are removed from the result and the resulting table isdeterministic.

PIP’s sampling process, including all techniques describedabove in Section IV-A is summarized in Algorithm 4.3.Despite the limited number of sampling techniques employed,this algorithm demonstrates the breadth of state availableto the PIP framework at sample-time. Independent groupsampling requires set of constraints. CDF sampling requiresdistribution-specific knowledge. Metropolis sampling requiressimilar knowledge, and also employs bounds on P [reject] tomake efficiency decisions. All of this information is availableto the expectation operator, making it the ideal place to imple-ment these, as well as more advanced optimization techniques.

Page 8: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

Algorithm 4.3: The expectation operator; Given an expressionand context condition, compute the expression’s expectationgiven that the condition is true and optionally also computethe probability P [C] of the condition being true. If samplingis necessary, both values are computed with ε, δ precision.

1) expectation(Expression E, Condition C, Bool getP, Goal {ε, δ})2) let X = the set of variables in E3) let target =

√2·erf−1(1− ε)

4) let N = 0; Sum = 0; SumSq = 05) foreach variable group K ∈ C s.t. ∃X ∈ K ∧X ∈ X6) let Count[K] = 0; Sampler[X] =Natural ∀X ∈ K7) consistencyCheck(K) (See Alg 3.2)

(save the bounds map S generated by consistencyCheck())8) If inconsistent, return (NAN, 0)

(if a given variable has bounds, try to sample within those bounds)9) foreach X s.t. S[X] 6= [−∞,∞]∧X has CDF/CDF−1 functions

10) Sampler[X] = CDF11) (keep sampling until we have enough samples for ε, δ precision.)12) while

htarget ·

˛(Sum

N)2 − SumSq

N

˛+ Sum

N

i< [δ · Sum]

AND N < 1/delta13) N = N+1

(We only need samples for variables in both E and C)14) foreach variable group K ∈ C s.t. ∃X ∈ K ∧X ∈ X15) if Sampler[X] = Metropolis for any X ∈ K16) run metropolis over saved state for group K17) else do...18) Count[K] = Count[K]+119) if (Count[K]−N)/Count[K] > Metropolis Threshold20) if all X have a PDF function21) Sampler[X] = Metropolis ∀X ∈ K22) scan for start point to initialize Metropolis state for K23) if unable to find a start point return (NAN, 0)24) continue to next K (line 14)25) else return (NAN, 0)26) generate one instance of all X ∈ K using Sampler[X]27) ... while samples do not satisfy K28) update Sum = Sum + E[X], SumSq = SumSq + E[X]2

29) let Prob =Q

KN

Count[K]∀K not sampled via Metropolis

(We’ve gotten some work towards P [C] for free already)(If the we actually care about it, we might have more work)

30) if getP = True(Variables in C but not E haven’t been processed yet)(Also, Metropolis doesn’t give us a probability)

31) foreach variable group K ∈ Cs.t. 6 ∃X ∈ K ∧ [X ∈ X ∨ Sampler[X] = Metropolis]

32) if |K|V ars = 1 AND X ∈ K has a CDF function33) use CDF to integrate P [K]34) else integrate by sampling as above (w/o metropolis)35) update Prob = Prob · P [K]36) return (Sum

N, P rob) (P is ignored if getP = False)

C. Aggregate Sampling Operators

Aggregate operators (eg. sum, avg, stddev) applied to c-tables introduce a new form of complexity into the samplingprocess: the result of an aggregate operator applied to a c-table is difficult to represent and sample from. Even if thevalues being aggregated is a constant, each row’s context mustbe evaluated independently. The result is 2n possible outputs,each with a linear number of conditions in the number ofrows. If the values being aggregated are variable expressions,the result is an identical number of outputs, each containingdata linear in the size of the table.

This added complexity, coupled with the frequency withwhich they appear at the root of a query plan, makes theman ideal point at which to perform sampling. As aggregatescompute expectations over entire tables, the probability of agiven row’s presence in the table can be included are includedin the aggregate’s expectation. This behavior is termed per-

table sampling semantics.We begin with the simplest form of aggregate expecta-

tion, that of an aggregate that obeys linearity of expecta-tion (E[f(~y)] = f( ~E[y])), such as sum(). Such aggregatesare straightforward to implement: per-row expectations off(~y)χ(~y) are computed, and aggregated (e.g., summed up).Of note however, is the effect that the operator has on thevariance of the result. In the case of sum(), each expectationcan be viewed as a Normally distributed random variable witha shared, predetermined variance. By the law of large numbers,the sum of a set of N random variables with equal standarddeviation σ has a variance of σ√

N. In other words, when

computing the expected sum of N variables, we can reducethe number of samples taken for each individual element bya factor of 1√

N.

If the operator does not obey linearity of expectation (e.g.,the max aggregate), the aggregate implementation is mademore difficult. Any aggregate may still be implemented naivelyby evaluating it in parallel on a set of sample worlds instan-tiated prior to evaluation. This is a worst-case approach tothe problem; it may be necessary to perform a second passover the results if an insufficient number of sample worlds aregenerated. However, more efficient special case aggregates,specifically designed to compute expectations are possible.

For example, consider the max() aggregate. If the targetexpression is a constant, this aggregate can be implementedextremely efficiently. Given a table sorted by the target expres-sion in descending order, PIP estimates the probability thatthe first element in the table (the highest value) is present.The aggregate expectation is initialized as the product ofthis probability and the first element. The second term ismaximal only if the first term is not present; when computingthe probability of the second term, we must compute theprobability of all the second term’s constraint atoms beingfulfilled while at least one of the first atom’s terms is notfulfilled. Though the complexity of this process is exponentialin the number of rows, the probability of each successive rowbeing maximal drops exponentially.

Example 4.4: To illustrate this, consider the table (anno-tated with probabilities)

R A φ P [φ]5 X ≥ 7 0.74 Y ≥ 7 0.81 Z ≥ 7 0.30 Q ≥ 7 0.6

Based on the probabilities listed above,

E[max(A)] = 5 · 0.7 + 4 · 0.8 + 1 · 0.3 + 0 · 0.6

However, if the desired precision is 0.1, we can stop scanningafter the second record since the maximum any later recordcan change the result is 1− (1− 0.7) ∗ (1− 0.8) = 0.056.

V. IMPLEMENTATION

In order to evaluate the viability of PIP’s c-tables approachto continuous variables, we have implemented an initial ver-

Page 9: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

Postgres PIPRewriter

SQL Functions

SamplerIntegrator

Distribution Definitionspvars

PIPDatatypes

Core

PIP Plugin

Fig. 3. The PIP Postgres plugin architecture

sion of PIP as an extension to the PostgreSQL DBMS asshown in Figure 3.

A. Query Rewriting

Much of this added functionality takes advantage of Post-greSQL’s extensibility features, and can be used “out-of-the-box”. For example, we define the function

CREATE VARIABLE(distribution[,params])which is used to create continuous variables2. Each callallocates a new variable, or a set of jointly distributed variablesand initializes it with the specified parameters. When definingselection targets, operator overloading is used to make randomvariables appear as normal variables; arbitrary equations maybe constructed in this way.

To complete the illusion of working with static data, wehave modified PostgreSQL itself to add support for C-Tableconstructs. Under the modified PostgreSQL when defining adatatype, it is possible to declare it as a CTYPE; doing so hasthe following three effects:• CTYPE columns (and conjunctions of CTYPE columns)

may appear in the WHERE and HAVING clauses of a SE-LECT statement. When found, the CTYPE componentsof clause are moved to the SELECT’s target clause.select *from inputswhere X>Y and Z like ’%foo’

is rewritten toselect *, X>Yfrom inputswhere Z like ’%foo’

• SELECT target clauses are rewritten to ensure that allCTYPE columns in input tables are passed through. Theexception to this is in the case of special probability-removing functions. If the select statement contains oneor more such functions (typically aggregates, or the confoperator), CTYPE columns are not passed through.select X,Yfrom inputs

is rewritten toselect X,Y,inputs.phi1,inputs.phi2,...from inputs

• In the case of aggregates, the mechanism by whichCTYPE columns may be passed through is unclear. ThusIf the select statement contains an aggregate and one or

2For discrete distributions, PIP uses a repair-key operator similar to thatused in [11]

Rctable A B φX ∗ 3 5 X > Y ∧ Y > 3

Y 3 Y < 3 ∨X < Y

⇓⇓⇓

Rint A (VarExp) B (integer) φ1 (CTYPE) φ2 (CTYPE)X ∗ 3 5 X > Y Y > 3

Y 3 Y < 3 NULLY 3 X < Y NULL

Fig. 4. Internal representation of C-Tables

more input tables have CTYPE columns, the query causesan error unless the aggregate is labeled as a probability-removing function.

• UNION operations are rewritten to ensure that the num-ber of CTYPE columns in their inputs is consistent. If oneinput table has more CTYPE columns of a given type thanthe other, the latter is padded with NULL constraints.-- left(X,phi1), right(X,phi1,phi2)select *from left UNION right

is rewritten toselect *from (select *,NULL AS phi2 FROM left

) UNION right

Note that these extensions are not required to access PIP’score functionality; they exist to allow users to seamlessly usedeterministic queries on probabilistic data.

PIP takes advantage of this by encoding constraint atomsin a CTYPE datatype; Overloaded > and < operators returna constraint atom instead of a boolean if a random variableis involved in the inequality, and the user can ignore thedistinction between random variable and constant value (untilthe final statistical analysis).

B. Defining DistributionsPIP’s primary benefit over other c-tables implementations is

its ability to admit variables chosen from arbitrary continuousdistributions. These distributions are specified in terms of gen-eral distribution classes, a set of C functions that describes thedistribution. In addition to a small number of functions usedto parse and encode parameter strings, each PIP distributionclass defines one or more of the following functions.

• Generate(Parameters, Seed) uses a pseudorandom numbergenerator to generate a value sampled from the distribution. The seedvalue allows PIP to limit the amount of state it needs to maintain;multiple calls to Generate with the same seed value produce the samesample, so only the seed value need be stored.

• PDF(Parameters, x) evaluates the probability density function ofthe distribution at the specified point.

• CDF(Parameters, x) evaluates the cumulative distribution func-tion at the specified point.

• InverseCDF(Parameters, Value) evaluates the inverse of thecumulative distribution function at the specified point.

PIP requires that all distribution classes define a Generatefunction. All other functions are optional, but can be usedto improve PIP’s performance if provided; The supplementalfunctions need only be included when known methods existfor evaluating them efficiently.

C. Sampling FunctionalityPIP provides several functions for analyzing the uncertainty

encoded in a c-table. The two core analysis functions are conf()and expectation().

Page 10: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

0

500

1000

1500

2000

2500

3000

3500

.25 .05 .01 0.005

Exe

cutio

n tim

e (s

)

Selectivity

PIPSample First

Fig. 5. Time to complete a 1000 sample query, accounting for selectivity-induced loss of accuracy.

0

50

100

150

200

250

300

Query 1 Query 2

Query 3 Query 4

Exe

cutio

n tim

e (s

)

(2985 s)

Query PhaseSample Phase

Sample First

Fig. 6. Query evaluation times in PIP and Sample-First for a range of queries.Sample-First’s sample-count has been adjusted to match PIP’s accuracy.

• conf() performs a conjunctive integration to estimate the probabilityof a specific row’s condition being true. For tables of purely conjunctiveconditions, conf() can be used to compute each row’s confidence.

• aconf(), a variant of conf(), is used to perform general integration.This function is an aggregate that computes the joint probability of allequivalent rows in the table, a necessity if disjunctions are in use.

• expectation() computes the expectation of a variable by repeatedsampling. If a row is specified when the function is called, the samplingprocess is constrained by the constraint atoms present in the row.

• expected sum(), expected max() are aggregate variants of ex-pectation. As with expectation() they can be parametrized by a row tospecify constraints.

• expected sum hist(), expected max hist() are similar tothe above aggregates in that they perform sampling. However, insteadof outputting the average of the results, it instead outputs an array of allthe generated samples. This array may be used to generate histogramsand similar visualizations.

Aggregates pose a challenge for the query phase of the PIPevaluation process. Though it is theoretically possible to createcomposite variables that represent aggregates of their inputs, inpractice it is infeasible to do so. The size of such a compositeis not just unbounded, but linear in the size of the input table.A variable symbolically representing an aggregate’s outputcould easily grow to an unmanageable level. Instead, PIP limitsrandom variable aggregation to the sampling phase.

VI. EVALUATION

As a comparison point for PIP’s ability to manage con-tinuous random variables, we have constructed a sample-first probabilistic extension to Postgres that emulates MCDB’s

tuple-bundle concept using ordinary Postgres rows. A sampledvariable is represented using an array of floats, while the tuplebundle’s presence in each sampled world is represented usinga densely packed array of booleans. In lieu of an optimizer,test queries were constructed by hand so as to minimize thelifespan of either array type.

Using Postgres as a basis for both implementations placesthem on an equal footing with respect to DBMS optimizationsunrelated to probabilistic data. This makes it possible to focusour comparison solely on the new, probabilistic functionalityof both systems. However, to make the distinction from MCDB(which is a separate development not based on Postgres) ex-plicit, we refer to our implementation of the MCDB approachas Sample-First.

We evaluated both the PIP C-Tables and the Sample-First in-frastructure against a variety of related queries. Tests were runover a single connection to a modified instance of PostgreSQL8.3.4 with default settings running on a 2x4 core 2.0 GHz IntelXeon with a 4MB cache. Unless otherwise specified, querieswere evaluated over a 1 GB database generated by the TPC-H benchmark, all sampling processes generate 1000 samplesapiece, and results shown are the average of 10 sequentialtrials with error bars indicating one standard deviation.

First, we demonstrate PIP’s performance on a simple set ofqueries ideally suited to the strengths of Sample-First. Thesetwo queries (identical to Q1 and Q2 from [10]) involve para-metrizing a table of random values, applying a simple set ofmath operations to the values, and finally estimating the sumof a large aggregate over the table.

The first query computes the rate at which customer pur-chases have increased over the past two years. The percentincrease parametrizes a Poisson distribution that is used topredict how much more each customer will purchase in thecoming year. Given this predicted increase in purchasing,the query estimates the company’s increased revenue for thecoming year.

In the second query, past orders are used to compute themean and standard deviation of manufacturing and shippingtimes. These values parametrize a pair of Normal distributionsthat combine to predict delivery dates for each part orderedtoday from a Japanese supplier. Finally, the query computesthe maximum of these dates, providing the customer with anestimate of how long it will take to have all of their partsdelivered.

The results of these tests are shown as query Q1 and Q2,respectively, in Figure 6. Note that performance times forPIP are divided into two components: query and sample, todistinguish between time spent evaluating the deterministiccomponents of a query and building the result c-table, andtime spent computing expectations and confidences of theresults. The results are positive; the overhead of the addedinfrastructure is minimal, even on queries where Sample-First is sufficient. Furthermore, especially in Q2, the samplingprocess comprises a relatively small portion of the query;additional samples can be generated without incurring thenearly 1 minute query time.

Page 11: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

0.001

0.01

0.1

1

10

1 10 100 1000

RM

S E

rror

Number of Samples

PIPSample First

(a)

0.01

0.1

1

10

1 10 100 1000

RM

S E

rror

Number of Samples

PIPSample First

(b)Fig. 7. RMS error across the results of 30 trials of (a) a simple group-by query Q4 with a selectivity of 0.005, and (b) a complex selection query Q5 withan average selectivity of 0.05.

The third query Q3 in Figure 6 combines a simplified formof queries Q1 and Q2. Rather than aggregating, the querycompares the delivery times of Q2 to a set of “satisfactionthresholds.” This comparison results in a (probabilistic) tableof dissatisfied customers that is used in conjunction withQ1’s profit expectations to estimate profit lost to dissatisfiedcustomers. A query of this form might be run on a regularbasis, perhaps even daily. As per this usage pattern, we pre-materialize the component of this query unlikely to change ona daily basis: the expected shipping time parameters.

Though PIP and Sample-First both take the same amountof time to generate 1000 samples under this query, the query’sselectivity causes Sample-First to disregard a significant frac-tion of the samples generated; for the same amount of work,Sample-First generates a less accurate answer. To illustratethis point, see Figure 7(a). This figure shows the RMS error,normalized by the correct value in the results of a query forpredicted sales of 5000 parts in the database, given a Poissondistribution for the increase in sales and a popularity multiplierchosen from an exponential distribution. As an additional re-striction, the query considers only the extreme scenario wherethe given product has become extremely popular (resulting ina selectivity of e−5.29 ≈ 0.005).

RMS error was computed over 30 trials using the alge-braically computed correct value as a mean, and then aver-aged over all 5000 parts. Note that PIP’s error is over twoorders of magnitude lower than the sample-first approach fora comparable number of samples. This corresponds to theselectivity of the query; as the query becomes more selective,the sample-first error increases. Furthermore, because CDFsampling is used to restrict the sampling bounds, the timetaken by both approaches to compute the same number ofsamples is equivalent.

A similar example is shown in Figure 7(b). Here, a modelis constructed for how much product suppliers are likely to beable to produce in the coming year based on an Exponentialdistribution, and for how much product the company expectsto sell in the coming year as in Q1. From this model, theexpected underproduction is computed, with a lower bound of0; the selection criterion considers only those worlds where

demand exceeds supply. For the purposes of this test, themodel was chosen to generate an average selectivity of 0.05.Though the comparison of 2 random variables necessitates theuse of rejection sampling and increases the time PIP spendsgenerating samples, the decision to drop a sample is madeimmediately after generating it; PIP can continue generatingsamples until it has a sufficient number, while the Sample-Firstapproach must rerun the entire query.

Note the relatively large variance in the RMS error ofthe Sample-First results these figures, particularly the firstone. Here, both the selectivity and the price for each partvary with the part. Thus, some parts become more importantwhile others become harder to sample from. In order to geta consistent answer for the entire query Sample-First mustprovision enough samples for the worst case, while PIP candynamically scale the number of samples required for eachterm.

Returning to Figure 6, Queries Q3 and Q4 have been runwith PIP at a fixed 1000 samples. As Sample-First drops allbut a relatively small number of samples corresponding tothe selectivity of the query, we run Sample-First with a corre-spondingly larger number of samples. For Query 3, the averageselectivity of 0.1 resulted in Sample-First discarding 90% ofits samples. To maintain comparable accuracies, Sample-Firstwas run at 10,000 samples.

We expand on this datapoint in Figure 5 where we evaluateQ4, altered to have varying selectivities. The sample-first testsare run with 1

selectivity times as many samples as PIP tocompensate for the lower error, in accordance with Figure7(a). Note that selectivity is a factor that a user must be awareof when constructing a query with sample-first while PIP isable to account for selectivity automatically, even if rejectionsampling is required.

It should also be noted that both of these queries includetwo distinct, independent variables involved in the expectationcomputation. A studious user may note this fact and handoptimize the query to compute these values independently.However, without this optimization, a sample-first approachwill generate one pair of values for each customer for eachworld. As shown in the RMS error example, an arbitrarily

Page 12: PIP: A Database System for Great and Small Expectationsokennedy/papers/pip.icde2010.pdfProbabilistic database management systems [3], [4], [21], ... and distinct-aggregates with or

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0 10 20 30 40 50 60 70 80 90 100

Err

or

Cumulative Distribution

Sample First

Fig. 8. Sample-First error as a fraction of the correct result in a danger-estimation query on the NSIDC’s Iceberg Sighting Database. PIP was able toobtain an exact result.

large number of customer revenue values will be discarded andthe query result will suffer. In this test, customer satisfactionthresholds were set such that an average of 10% of customerswere dissatisfied. Consequently sample-first discarded an aver-age of 10% of its values. To maintain comparable accuracies,the sample-first query was evaluated with 10,000 sampleswhile the PIP query remained at 1000 samples.

As a final test, we evaluated both PIP and our Sample-First implementation on the NSIDC’s Iceberg SightingDatabase[19] for the past 4 years. 100 virtual ships wereplaced at random locations in the North Atlantic, and eachship’s location was evaluated for its proximity to potentialthreats; Each iceberg in the database was assigned a normallydistributed position relative to its last sighting, and an exponen-tially decaying danger level based on time since last sighting.Recently sighted icebergs constituted a high threat, whilehistoric sightings represented potential new iceberg locations.The query identified icebergs with greater than a 0.1% chanceof being located near the ship and estimated the total thretposed by all potentially nearby icebergs. The results of thisexperiment are shown in Figure VI. PIP was able to employCDF sampling and obtain an exact result within 10 seconds.By comparison, the Sample-First implementation generating10,000 samples took over 10 minutes and produced resultsdeviating by as much as 25% from the correct result on atypical run.

VII. CONCLUSION

The use of symbolic representations can be exploited tosignificantly reduce query processing time and improve queryaccuracy for a wide range of queries, even with only straight-forward algorithms. For the remaining queries, we have shownthat the overhead created by the symbolic representation hasonly a negligible impact on query processing time. This,combined with PIP’s extensibility, make it a powerful platformfor evaluating a wide range of queries over uncertain data.

We have shown that symbolic representations of uncertaintylike C-Tables can be used to make the computation of

expectations, moments, and other statistical measures in prob-abilistic databases more accurate and more efficient. Theavailability of the expression being measured enables a broadrange of sampling techniques that rely on this information andallows more effective selection of the appropriate technique fora given expression.

Acknowledgments: This material is based upon work supported by theNational Science Foundation under Grant IIS-0812272. Any opinions, findingsand conclusions or recommendations expressed in this material are those ofthe author(s) and do not necessarily reflect the views of the National ScienceFoundation (NSF).

REFERENCES

[1] L. Antova, T. Jansen, C. Koch, and D. Olteanu. “Fast and SimpleRelational Processing of Uncertain Data”. In Proc. ICDE, 2008.

[2] L. Antova, C. Koch, and D. Olteanu. “10106Worlds and Beyond:

Efficient Representation and Processing of Incomplete Information”. InProc. ICDE, 2007.

[3] R. Cheng, D. V. Kalashnikov, and S. Prabhakar. “Evaluating Probabilis-tic Queries over Imprecise Data”. In Proc. SIGMOD, pages 551–562,2003.

[4] N. Dalvi and D. Suciu. “Efficient query evaluation on probabilisticdatabases”. VLDB Journal, 16(4):523–544, 2007.

[5] A. Deshpande and S. Madden. “MauveDB: supporting model-based userviews in database systems”. In Proc. SIGMOD, pages 73–84, 2006.

[6] W. Gilks, S. Richardson, and D. Spiegelhalter. Markov Chain MonteCarlo in Practice: Interdisciplinary Statistics. Chapman and Hall/CRC,1995.

[7] T. J. Green and V. Tannen. “Models for Incomplete and ProbabilisticInformation”. In International Workshop on Incompleteness and Incon-sistency in Databases (IIDB), 2006.

[8] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. InSIGMOD Conference, pages 171–182, 1997.

[9] T. Imielinski and W. Lipski. “Incomplete information in relationaldatabases”. Journal of ACM, 31(4):761–791, 1984.

[10] R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas.“MCDB: A Monte Carlo approach to managing uncertain data”. In Proc.ACM SIGMOD Conference, pages 687–700, 2008.

[11] C. Koch. “MayBMS: A system for managing large uncertain andprobabilistic databases”. In C. Aggarwal, editor, Managing and MiningUncertain Data, chapter 6. Springer-Verlag, Feb. 2009.

[12] A. M. Law and W. D. Kelton. Simulation Modeling and Analysis.McGraw-Hill Book Company, 1982.

[13] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H.Teller, andE. Teller. Equation of state calculations by fast computing machines.Journal of Chemical Physics, 21(6), June 1953.

[14] N. Metropolis and S. Ulam. The monte carlo method. Journal of theAmerican Statistical Association, 44(335), 1949.

[15] C. Re and D. Suciu. Materialized views in probabilistic databases:for information exchange and query optimization. In VLDB ’07:Proceedings of the 33rd international conference on Very large databases, pages 51–62. VLDB Endowment, 2007.

[16] F. Rusu, F. Xu, L. L. Perez, M. Wu, R. Jampani, C. Jermaine, andA. Dobra. The dbo database system. In SIGMOD Conference, pages1223–1226, 2008.

[17] P. Sen and A. Deshpande. “Representing and Querying CorrelatedTuples in Probabilistic Databases”. In Proc. ICDE, pages 596–605,2007.

[18] S. Singh, C. Mayfield, S. Mittal, S. Prabhakar, S. E. Hambrusch, andR. Shah. “Orion 2.0: native support for uncertain data”. In Proc. ACMSIGMOD Conference, pages 1239–1242, 2008.

[19] N. Snow and I. D. C. D. C. for Glaciology. International ice patrol (iip)iceberg sightings database. Digital Media, 2008.

[20] D. Z. Wang, E. Michelakis, M. Garofalakis, and J. M. Hellerstein.“BayesStore: Managing large, uncertain data repositories with proba-bilistic graphical models”. In VLDB ’08, Proc. 34th Int. Conference onVery Large Databases, 2008.

[21] J. Widom. “Trio: a system for data, uncertainty, and lineage”. InC. Aggarwal, editor, Managing and Mining Uncertain Data. Springer-Verlag, Feb. 2009.