Pavan PoluriPavan PoluriSiddharth Deokar
Varun Sudhakar
I/O Processing & File size
management
Workload Distribution
Get the article count and word
count
Redistribute files Create one Perform a
Generic View of System
Redistribute files for calculating suffix arrays
Create one final Suffix Array per P
Perform a Binomial
Reduction
Retrieve the top R interesting
ngrams
Done!!! LEGEND:SiddharthVarunPavanAll
FOSTER’S DESIGN IN OUR PROJECT� Partitioning: Domain Decomposition� Communication : Broadcasting, Point to Point
Communication and Customized Communication� Agglomeration: Gathering of suffix arrays� Agglomeration: Gathering of suffix arrays� Mapping: Cyclic Mapping Strategy
Data Structure� Customized suffix array1 to hold the following data
� Position of ngram in the file� File index to identify the file� Term Frequency� Term Frequency� Document Frequency
� Customized suffix array2 to hold the following data� Position of ngram in the file� File index to identify the file� Term Frequency� TF*IDF value
Algorithm� I/O processing
� Reading directory and storing file information
� File size ManagementPartitioning files � Partitioning files
� Communication
� Workload Distribution� Interleaved Allocation
Contd…..� Alpha Requirement:
� Calculating the number of words and articles � Reduction
� MPI_Reduce()� MPI_Reduce()
Contd…� Suffix Array Calculation
� Every word has a suffix array associated with it� Allocating memory to suffix array based on the alpha
outputoutput� Filling the details of suffix arrays of all words
� Getting the position of the word in the file� Getting the file index of the file the word is in� Assigning term frequency� Assigning document frequency
Contd…� Sorting the Suffix arrays
� Based on Quick sort algorithm� Timing Complexity of quick sort : O(NlogN) (average case)� Memory Requirement : O(NlogN)� Memory Requirement : O(NlogN)
Contd…� Finding Distinct terms in same article
Cat(TF=1, DF=1) Cat(TF=1, DF=1)
House(TF=1,DF=1)
House(TF=2,DF=1)
House(TF=3,DF=1)
House(TF=6,DF=1)
Contd…� Finding Distinct terms in different articles
Cat(TF=1, DF=1) Cat(TF=1, DF=1)
House(TF=1,DF=1)
House(TF=2,DF=1)
House(TF=3,DF=2)
Contd…� Merging Suffix Arrays
� Input: Two sorted suffix arrays � Reading ngrams from file� Output: One sorted suffix array� Output: One sorted suffix array
F
I
R
S
C
H
Z
C F
I
R
S
C
H
Z
C
F
F
I
R
S
C
H
Z
C
F
H
MERGE EXAMPLE
F
I
R
S
C
H
Z
C
F
H
I
F
I
R
S
C
H
Z
C
F
H
I
R
F
I
R
S
C
H
Z
C
F
H
I
R
S
F
I
R
S
C
H
Z
C
F
H
I
R
S
Z
Contd…� Communication Strategies
� Reading and Writing files (Strategy 1 - deprecated)� Binomial Tree Reduction and Nomenclature
� Use of MPI_Barrier� Use of MPI_Barrier� Single file corresponding to suffix array
� Communicating Structures (Strategy 2)� Binomial Tree Reduction
� Use of MPI_Pack, MPI_Unpack()
Contd…� Binomial Tree Reduction
P1 P03
P3 P2
P7
P5
P6
P41
1
1
1
Contd…� Finding top R interesting terms
� Calculation and Storage� New suffix array structure with IDFTF measure
� Sorting � Sorting � Merging
Analysis � Alpha
15
20
Alpha
Tim
e in
sec
onds
#P#P#P#P TimeTimeTimeTime in in in in secssecssecssecs
0
5
10
16 32 64 128 256
Alpha
Number of Processors
Tim
e in
sec
onds16 18.8
32 11.2
64 6.2
128 4.8
256 3.2
50006000700080009000
10000
0.6mb
80mb
Tim
e in
sec
onds
#P#P#P#P TimeTimeTimeTime in in in in secssecssecssecs DataDataDataData
2 326 0.6MB
4 314 0.6MB
#P#P#P#P TimeTimeTimeTime in in in in secssecssecssecs DataDataDataData
4 6600 80MB
01000200030004000
2 4 8 16 32 64
80mb
120mb
Number of Processors
Tim
e in
sec
onds
4 6600 80MB
8 9317 80MB
#P#P#P#P Time inTime inTime inTime in secssecssecssecs DataDataDataData
4 8000 120MB
16 3156 120MB
600
800
1000
1200
ngram = 1
Dat
a in
mb
#P#P#P#P Data in Data in Data in Data in MBMBMBMB
Execution Execution Execution Execution Time in secTime in secTime in secTime in sec
2 0.6 326
4 28 1620
8 80 9317
0
200
400
600
2 4 8 16 32 64
ngram = 1
Number of Processors
Dat
a in
8 80 9317
16 272 7297
32 548 12000
64 1024 14694
Formula� Amdahl’s Law
� Ψ <= 1/f+(1-f)/p� where f is the serial component and p is the number of
processorsprocessors� Ψ is the speedup
� Gustafson’s Law� Ψ <= p+(1-p)s
� Ψ is the scaled speed up� s is the serial component and p is the number of processors
Contd…� Using our results for data of size 120 MB
� Speed up = 7680/3156=2.4� Considering the case where 4 processors as serial and 16
processors as parallelprocessors as parallel
� Using the formula for Amdahl’s Law and substituting Ψ as 2.4 we get f = 0.22
� According to Gustafson’s Law using s = 0.22, Ψ (scaled speed up) = 3.34
Contact Info� Project web page: giga word corpus� Email� Pavan Poluri: [email protected]
Siddharth Deokar: [email protected]� Siddharth Deokar: [email protected]� Varun Sudhakar: [email protected]