Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computingn-Gram Statistics

in MapReduce

Klaus Berberich([email protected])

Srikanta Bedathur([email protected])

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

n-Gram Statistics

✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including

✦ Information Retrieval

✦ Natural Language Processing

✦ Digital Humanities

2


n-Gram Statistics





2

rates hilton paris

the hilton parisoffers great rates

in the summerd42


n-Gram Statistics





2

siri how is the

rates hilton paris


in the summerd42


n-Gram Statistics





2

siri how is the

rates hilton paris


in the summerd42

thou shalt notdon’t ya


Problem Statement

✦ Can be seen as a special case of frequent sequence mining (no gaps, single-item transaction only) with slightly different notion of frequency

✦ Our focus is on large-scale document collections (millions of documents or more, natural language)

3

How can we efficiently compute statistics about n-grams, that occur at least τ times and consist of at most σ words,

using MapReduce?


Outline

✦ Motivation

✦ Competitors & Challenges

✦ SUFFIX-σ

✦ Extensions

✦ Experimental Evaluation

✦ Conclusion

4


MapReduce

✦ Distributed data processing platform by Google [1]

✦ for clusters of commodity hardware

✦ handles hardware/software failures transparently

✦ available as open-source Apache Hadoop

✦ Programming model operating on key-value pairs

✦ map() : <k1,v1> -‐> list<k2,v2>

✦ reduce() : <k2,list<v2>> -‐> list<k3,v3>

✦ compare() partition()

5

[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004


✦ Determine counts of all individual words

WORD COUNT

6

map(did, content): for all words in content: emit(word, did)

reduce(word, list<did>): emit(word, length(list<did>))


✦ Determine counts of all individual words

WORD COUNT

6

map(did, content): for all words in content: emit(word, did)

reduce(word, list<did>): emit(word, length(list<did>))

d1@t1a x bb a y

d2@t2b y ax a b

(a,4)(b,4)…

(x,2)(y,2)…

Map

M1

Mn

(a,d1@t1),(x,d1@t1),

…

(b,d2@t2),(y,d2@t2),

…

map()

Reduce

R1

Rm

(a,d1@t1),(a,d2@t2),

…

(x,d1@t1),(x,d2@t2),

…

reduce()

Shuffle1

m

1

m

1

m

1

m

partition()compare()


N-GRAM COUNT

7

map(did, content): for k in <1 ... σ >: for all k-‐grams in content: emit(k-‐gram, did)

reduce(n-‐gram, list<did>): if length(list<did>) >= τ: emit(n-‐gram, length(list<did>))

[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007


N-GRAM COUNT

7



d1@t1a x bb a y

(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)



N-GRAM COUNT

7





✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times

APRIORI-SCAN & APRIORI-INDEX

8

[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001



d1@t1a x bb a y

(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)

{ }

(1)

APRIORI-SCAN


8




d1@t1a x bb a y

(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)

{ }

(1)

APRIORI-SCAN


8


d1@t1a x bb a y

(ax,d1@t1),(ay,d1@t1),…(bb,d1@t1),(ba,d1@t1),…(xb,d1@t1)

{a,b,x,y}

(2)




8





8


APRIORI-INDEX

(2)ab d5@t5 [2,7] d7@t7 [1,11]

bx d5@t5 [8] d7@t7 [2]

abx d5@t5 [7] d7@t7 [1]




8


APRIORI-INDEX

(2)ab d5@t5 [2,7] d7@t7 [1,11]

bx d5@t5 [8] d7@t7 [2]

abx d5@t5 [7] d7@t7 [1]

(3)abx d8@t8 [2,7] d9@t9 [1,11]

bxy d8@t8 [3]

abxy d8@t8 [2]


Challenges & Desiderata

✦ Single MapReduce Job(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)

✦ Communication Cost(N-GRAM COUNT ✗ / APRIORI-SCAN ✓ / APRIORI-INDEX ✓)

✦ Main-Memory Consumption(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)

✦ Ease of Implementation(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)

9


Outline

✦ Motivation


✦ SUFFIX-σ

✦ Extensions


✦ Conclusion

10


SUFFIX-σ

✦ SUFFIX-σ is based on three key ideas, inspired by methods from String Processing (e.g., suffix arrays)

✦ emit only suffixes of documents in map() to reduce communication cost

✦ partition() suffixes based on their first word

✦ sort suffixes in reverse lexicographic orderto limit main-memory consumption in reduce()

11


Suffixes

✦ SUFFIX-σ emits only suffixes of documents in map()

✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)

12

d1@t1a x bb a y



Suffixes

✦ SUFFIX-σ emits only suffixes of documents in map()

✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)

12

d1@t1a x bb a y


d1@t1a x bb a y



Partitioning

✦ SUFFIX-σ partitions suffixes based on their first word

✦ brings together suffixes representing same n-gram

✦ crucial for computation in single MapReduce job

13

(axbbay,d1@t1)(xbbay,d1@t1)(yyabbx,d3@t3)(yabbx,d3@t3)(axbyyx,d4@t4)(xbyyx,d4@t4)

(axbbay,d1@t1)(axbyyx,d4@t4)

(xbbay,d1@t1)(xbyyx,d4@t4)

(yyabbx,d3@t3)(yabbx,d3@t3)


Sorting

✦ SUFFIX-σ sorts suffixes in reverse lexicographic order

✦ bookkeeping using stack of bounded height σ

✦ crucial for low main-memory consumption

14

(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)


Sorting




14


(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)


Sorting




14



yabbxa

{d1@t1}

{d4@t4}{d7@d7}


Sorting




14



yabbxa

{d1@t1}

{d4@t4}{d7@d7}

(axbbay,1)(axbba,1)(axbb,1)(axb,2)(ax,3)


Sorting




14



axbba

{d5@t5}

{d1@t1, d4@t4, d7@t7}


SUFFIX-σ

15

map(did, content): for all suffixes in content: emit(suffix, did)

partition(suffix, did): return suffix[0] % m

compare(suffix0, suffix1): return -‐strcmp(suffix0, suffix1)


Outline

✦ Motivation


✦ SUFFIX-σ

✦ Extensions


✦ Conclusion

16


Extensions

✦ Closed/Maximal n-Grams

✦ SUFFIX-σ can emit only prefix-closed/maximal n-grams in reduce(); additional MapReduce job then identifies suffix-closed/maximal n-grams

✦ Other Aggregations

✦ n-gram time series

✦ n-gram inverted index

17

[1] J.-B. Michel et al.: Quantitative Analysis of Culture Using Millions of Digitized Books, Science 2010

a b

a b c

b c x

d2, d7, d9

d2, d7

d3, d6


Outline

✦ Motivation


✦ SUFFIX-σ

✦ Extensions


✦ Conclusion

18


Datasets & Setup

✦ The New York Times Annotated Corpus (NYT)1.8 million newspaper articles, 1987 – 2007, ~3 GB

✦ ClueWeb09-B (CW)50 million web documents, 2009, ~246 GB

✦ 10 Cluster Nodes (2x6 cores, 64 GB RAM, 4x2 TB HDD, Debian 5.0.9, 1 Gbit Ethernet, CDH3u0)

✦ Implementation operates on compressed integer sequences; datasets pre-processed accordingly

19


Use Cases

✦ Training a Statistical Language Model (LM)

✦ σ = 5 (i.e., n-grams consisting of up to five words)

✦ τ = 10 (NYT) / τ = 100 (CW)

✦ Identifying Repeated Text Fragments (RT)

✦ σ = 100 (to also capture quotations, idioms, etc.)

✦ τ = 100 (NYT) / τ = 1,000 (CW)

20


Results (LM)

21

1

10

100

1,000

10,000

NYT CW

81

3

3,809

37

240

9

309

10

Wal

lclo

ck T

ime

(min

utes

)

N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ


Results (RT)

22

1

10

100

1,000

10,000

NYT CW

229

5

393

62

338

77

15,000

117

Wal

lclo

ck T

ime

(min

utes

)

N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ


Outline

✦ Motivation


✦ SUFFIX-σ

✦ Extensions


✦ Conclusion

23


Conclusion

✦ SUFFIX-σ – computes n-gram statistics in MapReduce

✦ based on “suffix idea” from String Processing

✦ robust to wide variety of parameter choices

✦ outperforms state-of-the-art competitors

✦ runs in a single MapReduce job, consumes little main memory, and is easy to implement

24


Advertisements

✦ Codehttp://github.com/kberberi/mpiingrams

✦ EU ProjectLongitudinal Analytics of Web Archive Data

✦ Follow-Up WorkI. Miliaraki, K. Berberich, R. Gemulla, S. Zoupanos: Mind the Gap: Large-Scale Frequent Sequence Mining,SIGMOD 2013

25

http://github.com/kberberi/mpiingrams

http://github.com/kberberi/mpiingrams

http://www.lawa-project.eu

http://www.lawa-project.eu


Thank you!

Questions ?

Documents

Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen