Download pdf - Pavan Poluri SiddharthDeokar VarunSudhakargigawordcorpus.sourceforge.net/Docs/tikal-final-presentation.pdfSiddharth Varun Pavan All. FOSTER’S DESIGN IN OUR PROJECT Partitioning:

Pavan PoluriPavan PoluriSiddharth Deokar

Varun Sudhakar

I/O Processing & File size

management

Workload Distribution

Get the article count and word

count

Redistribute files Create one Perform a

Generic View of System

Redistribute files for calculating suffix arrays

Create one final Suffix Array per P

Perform a Binomial

Reduction

Retrieve the top R interesting

ngrams

Done!!! LEGEND:SiddharthVarunPavanAll

FOSTER’S DESIGN IN OUR PROJECT� Partitioning: Domain Decomposition� Communication : Broadcasting, Point to Point

Communication and Customized Communication� Agglomeration: Gathering of suffix arrays� Agglomeration: Gathering of suffix arrays� Mapping: Cyclic Mapping Strategy

Data Structure� Customized suffix array1 to hold the following data

� Position of ngram in the file� File index to identify the file� Term Frequency� Term Frequency� Document Frequency

� Customized suffix array2 to hold the following data� Position of ngram in the file� File index to identify the file� Term Frequency� TF*IDF value

Algorithm� I/O processing

� Reading directory and storing file information

� File size ManagementPartitioning files � Partitioning files

� Communication

� Workload Distribution� Interleaved Allocation

Contd…..� Alpha Requirement:

� Calculating the number of words and articles � Reduction

� MPI_Reduce()� MPI_Reduce()

Contd…� Suffix Array Calculation

� Every word has a suffix array associated with it� Allocating memory to suffix array based on the alpha

outputoutput� Filling the details of suffix arrays of all words

� Getting the position of the word in the file� Getting the file index of the file the word is in� Assigning term frequency� Assigning document frequency

Contd…� Sorting the Suffix arrays

� Based on Quick sort algorithm� Timing Complexity of quick sort : O(NlogN) (average case)� Memory Requirement : O(NlogN)� Memory Requirement : O(NlogN)

Contd…� Finding Distinct terms in same article

Cat(TF=1, DF=1) Cat(TF=1, DF=1)

House(TF=1,DF=1)

House(TF=2,DF=1)

House(TF=3,DF=1)

House(TF=6,DF=1)

Contd…� Finding Distinct terms in different articles

Cat(TF=1, DF=1) Cat(TF=1, DF=1)

House(TF=1,DF=1)

House(TF=2,DF=1)

House(TF=3,DF=2)

Contd…� Merging Suffix Arrays

� Input: Two sorted suffix arrays � Reading ngrams from file� Output: One sorted suffix array� Output: One sorted suffix array

F

I

R

S

C

H

Z

C F

I

R

S

C

H

Z

C

F

F

I

R

S

C

H

Z

C

F

H

MERGE EXAMPLE

F

I

R

S

C

H

Z

C

F

H

I

F

I

R

S

C

H

Z

C

F

H

I

R

F

I

R

S

C

H

Z

C

F

H

I

R

S

F

I

R

S

C

H

Z

C

F

H

I

R

S

Z

Contd…� Communication Strategies

� Reading and Writing files (Strategy 1 - deprecated)� Binomial Tree Reduction and Nomenclature

� Use of MPI_Barrier� Use of MPI_Barrier� Single file corresponding to suffix array

� Communicating Structures (Strategy 2)� Binomial Tree Reduction

� Use of MPI_Pack, MPI_Unpack()

Contd…� Binomial Tree Reduction

P1 P03

P3 P2

P7

P5

P6

P41

1

1

1

Contd…� Finding top R interesting terms

� Calculation and Storage� New suffix array structure with IDFTF measure

� Sorting � Sorting � Merging

Analysis � Alpha

15

20

Alpha

Tim

e in

sec

onds

#P#P#P#P TimeTimeTimeTime in in in in secssecssecssecs

0

5

10

16 32 64 128 256

Alpha

Number of Processors

Tim

e in

sec

onds16 18.8

32 11.2

64 6.2

128 4.8

256 3.2

50006000700080009000

10000

0.6mb

80mb

Tim

e in

sec

onds

#P#P#P#P TimeTimeTimeTime in in in in secssecssecssecs DataDataDataData

2 326 0.6MB

4 314 0.6MB

#P#P#P#P TimeTimeTimeTime in in in in secssecssecssecs DataDataDataData

4 6600 80MB

01000200030004000

2 4 8 16 32 64

80mb

120mb


Tim

e in

sec

onds

4 6600 80MB

8 9317 80MB

#P#P#P#P Time inTime inTime inTime in secssecssecssecs DataDataDataData

4 8000 120MB

16 3156 120MB

600

800

1000

1200

ngram = 1

Dat

a in

mb

#P#P#P#P Data in Data in Data in Data in MBMBMBMB

Execution Execution Execution Execution Time in secTime in secTime in secTime in sec

2 0.6 326

4 28 1620

8 80 9317

0

200

400

600

2 4 8 16 32 64

ngram = 1


Dat

a in

8 80 9317

16 272 7297

32 548 12000

64 1024 14694

Formula� Amdahl’s Law

� Ψ <= 1/f+(1-f)/p� where f is the serial component and p is the number of

processorsprocessors� Ψ is the speedup

� Gustafson’s Law� Ψ <= p+(1-p)s

� Ψ is the scaled speed up� s is the serial component and p is the number of processors

Contd…� Using our results for data of size 120 MB

� Speed up = 7680/3156=2.4� Considering the case where 4 processors as serial and 16

processors as parallelprocessors as parallel

� Using the formula for Amdahl’s Law and substituting Ψ as 2.4 we get f = 0.22

� According to Gustafson’s Law using s = 0.22, Ψ (scaled speed up) = 3.34

Contact Info� Project web page: giga word corpus� Email� Pavan Poluri: [email protected]

Siddharth Deokar: [email protected]� Siddharth Deokar: [email protected]� Varun Sudhakar: [email protected]