24
1 Agrios: A Hybrid Approach to Scalable Data Analysis Systems XLDB 2012 Patrick Leyshock, advised by David Maier NSF Award # 1110917

Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

1

Agrios: A Hybrid Approach to Scalable Data Analysis Systems

XLDB 2012 Patrick Leyshock, advised by David Maier

NSF Award # 1110917

Page 2: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

2

The problem – high level

The problem: scientists and engineers want to analyze large datasets … • We need useful information from this data

• Data: • size often exceeds main memory • modeled as multidimensional arrays

• Traditional tools cannot do the job

• Analytic systems do not perform well on large data • Database systems cannot perform sophisticated analytics

What to do?

Page 3: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

3

Three strategies for tackling the problem

Make analytic systems work better with large datasets

Create database systems that perform sophisticated analytics

Combine an analytic system with a database: “let each do what it does best”

+

Page 4: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

4

Agrios - overview

One example of a hybrid system: Agrios Integrates R and SciDB

Middleware between the two systems

+

Page 5: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

5

My contributions include: Semantic mapping between the R language and SciDB's Array Functional Language Design of an automated, cost-based interaction model between R and SciDB, that reduces data movement Start-to-finish system implementation – Agrios – constructed using R and SciDB Test results quantifying the performance of this particular hybrid approach

Contributions

Page 6: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

6

My contributions include: Semantic mapping between the R language and SciDB's Array Functional Language Design of an automated, cost-based interaction model between R and SciDB, that reduces data movement Start-to-finish system implementation – Agrios – constructed using R and SciDB Test results quantifying the performance of this particular hybrid approach

Contributions

Page 7: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

7

Decisions about data movement matter:

Decisions about data movement matter - example

Vectors R and C are stored at B, and we need their product at A. There are choices, and some choices are better than others

Page 8: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

8

Options for moving data

The four options require that different intermediate results be moved:

Option 1: Do both subtrees at R (move neither)

Option 2: Do the left subtree at R, the right subtree at SciDB (move results of right subtree)

Option 3: Do the left subtree at SciDB, the right subtree at R (move results of left subtree)

Option 4: Do both subtrees at SciDB (move results of both subtrees)

Page 9: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

9

What if the size of the inputs vary? What if the locations of the inputs vary?

Things that influence data movement

Page 10: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

10

Three strategies for automating the reduction of data movement:

• “Staging” expressions

• Rewriting expressions

• Consolidating expressions

Three strategies for reducing data movement

Page 11: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

11

A staging is an answer to the question: Where do we perform each operation?

Introduction to staging

operation_1: R operation_2: SciDB operation_3: SciDB . . . . . . operation_n: R

Page 12: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

12

Some possible stagings:

Staging – many options with complex expressions

there are many more …

Page 13: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

13

Select the best staging

Staging: many options with complex expressions

Page 14: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

14

Staging: 1 of 2

We need to calculate the total cost of the plan.

What do we already know? • Cost of left subtree at R: 100

• Cost of left subtree at SciDB: 1000

• Cost of right subtree at R: 2000

• Cost of right subtree at SciDB: 1

The cost of the plan

= cost of left subtree at R (100)

+ cost of right subtree at SciDB (1)

+ cost of moving right subtree from SciDB to R (# of data elements * cost per data element)

Page 15: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

15

Staging: 2 of 2

We calculate costs for each of the four options…

… and choose the best one.

We’ve used plan cost to determine the best staging.

1000 150

13,000 75

Page 16: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

16

Rewriting expressions: 1 of 3

The best staging has a cost of 4000 data elements.

Expression rewriting A sample plan:

Page 17: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

17

Rewrite the plan, using association:

Rewriting expressions: 2 of 3

Page 18: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

18

Rewriting expressions: 3 of 3

Through expression rewriting, we’ve exposed new staging opportunities, decreasing the cost of the plan:

Best staging: 4000 data elements

Best staging: 2000 data elements

Once we begin rewriting expressions, the number of staging opportunities grows

Page 19: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

19

Accumulation Suppose we have a couple of R statements:

Accumulating expressions: 1 of 2

A C op D;

. . .

result A op B;

Page 20: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

20

We can accumulate these statements:

Accumulating expressions: 2 of 2

Accumulating expressions exposes new staging opportunities

A C op D;

. . .

result A op B;

Page 21: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

21

Agrios design

Agrios contains four major subsystems: Parser Accumulator Stager Executor

Page 22: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

22

Preliminary results - staging

Compared staging to alternative approaches: • typically outperforms “do everything at R” • typically outperforms “do everything at SciDB” • performance matched by greedy approach

Page 23: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

23

Preliminary results – expression rewriting and accumulation

Expression rewriting: • Implemented associative and commutative rewrites • Bonneville successfully identifying optimal plans using these general rewrite rules

Accumulation: • Accumulated expressions are at least as good as individually-executed expressions • Beneficial when execution location is arbitrary

Page 24: Agrios: A Hybrid Approach to Scalable Data Analysis Systemsdche2/files/agrios.pdfSciDB's Array Functional Language Design of an automated, cost-based interaction model between R and

24

Thanks to David Maier, the SciDB Team, and the National Science Foundation

(Award # 1110917)

Thanks