42
Computing n-Gram Statistics in MapReduce Klaus Berberich ([email protected]) Srikanta Bedathur ([email protected])

Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computingn-Gram Statistics

in MapReduce

Klaus Berberich([email protected])

Srikanta Bedathur([email protected])

Page 2: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

n-Gram Statistics

✦ Statistics about variable-length word sequences(e.g., lord  of  the  rings, at  the  end  of, …)have many applications in fields including

✦ Information Retrieval

✦ Natural Language Processing

✦ Digital Humanities

2

Page 3: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

n-Gram Statistics

✦ Statistics about variable-length word sequences(e.g., lord  of  the  rings, at  the  end  of, …)have many applications in fields including

✦ Information Retrieval

✦ Natural Language Processing

✦ Digital Humanities

2

rates  hilton  paris

the  hilton  parisoffers  great  rates

in  the  summerd42

Page 4: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

n-Gram Statistics

✦ Statistics about variable-length word sequences(e.g., lord  of  the  rings, at  the  end  of, …)have many applications in fields including

✦ Information Retrieval

✦ Natural Language Processing

✦ Digital Humanities

2

siri how is the

rates  hilton  paris

the  hilton  parisoffers  great  rates

in  the  summerd42

Page 5: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

n-Gram Statistics

✦ Statistics about variable-length word sequences(e.g., lord  of  the  rings, at  the  end  of, …)have many applications in fields including

✦ Information Retrieval

✦ Natural Language Processing

✦ Digital Humanities

2

siri how is the

rates  hilton  paris

the  hilton  parisoffers  great  rates

in  the  summerd42

thou  shalt  notdon’t  ya

Page 6: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Problem Statement

✦ Can be seen as a special case of frequent sequence mining (no gaps, single-item transaction only) with slightly different notion of frequency

✦ Our focus is on large-scale document collections (millions of documents or more, natural language)

3

How can we efficiently compute statistics about n-grams, that occur at least τ times and consist of at most σ words,

using MapReduce?

Page 7: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Outline

✦ Motivation

✦ Competitors & Challenges

✦ SUFFIX-σ

✦ Extensions

✦ Experimental Evaluation

✦ Conclusion

4

Page 8: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

MapReduce

✦ Distributed data processing platform by Google [1]

✦ for clusters of commodity hardware

✦ handles hardware/software failures transparently

✦ available as open-source Apache Hadoop

✦ Programming model operating on key-value pairs

✦ map()  :  <k1,v1>  -­‐>  list<k2,v2>

✦ reduce()  :  <k2,list<v2>>  -­‐>  list<k3,v3>

✦ compare()  partition()

5

[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004

Page 9: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Determine counts of all individual words

WORD COUNT

6

map(did,  content):   for  all  words  in  content:     emit(word,  did)

reduce(word,  list<did>):   emit(word,  length(list<did>))

Page 10: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Determine counts of all individual words

WORD COUNT

6

map(did,  content):   for  all  words  in  content:     emit(word,  did)

reduce(word,  list<did>):   emit(word,  length(list<did>))

d1@t1a x bb a y

d2@t2b y ax a b

(a,4)(b,4)…

(x,2)(y,2)…

Map

M1

Mn

(a,d1@t1),(x,d1@t1),

(b,d2@t2),(y,d2@t2),

map()

Reduce

R1

Rm

(a,d1@t1),(a,d2@t2),

(x,d1@t1),(x,d2@t2),

reduce()

Shuffle1

m

1

m

1

m

1

m

partition()compare()

Page 11: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

N-GRAM COUNT

7

map(did,  content):   for  k  in  <1  ...  σ >:     for  all  k-­‐grams  in  content:       emit(k-­‐gram,  did)

reduce(n-­‐gram,  list<did>):   if  length(list<did>)  >=    τ:     emit(n-­‐gram,  length(list<did>))

[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007

Page 12: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

N-GRAM COUNT

7

map(did,  content):   for  k  in  <1  ...  σ >:     for  all  k-­‐grams  in  content:       emit(k-­‐gram,  did)

reduce(n-­‐gram,  list<did>):   if  length(list<did>)  >=    τ:     emit(n-­‐gram,  length(list<did>))

d1@t1a x bb a y

(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)

[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007

Page 13: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

N-GRAM COUNT

7

map(did,  content):   for  k  in  <1  ...  σ >:     for  all  k-­‐grams  in  content:       emit(k-­‐gram,  did)

reduce(n-­‐gram,  list<did>):   if  length(list<did>)  >=    τ:     emit(n-­‐gram,  length(list<did>))

[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007

Page 14: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times

APRIORI-SCAN & APRIORI-INDEX

8

[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001

Page 15: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times

d1@t1a x bb a y

(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)

{  }

(1)

APRIORI-SCAN

APRIORI-SCAN & APRIORI-INDEX

8

[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001

Page 16: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times

d1@t1a x bb a y

(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)

{  }

(1)

APRIORI-SCAN

APRIORI-SCAN & APRIORI-INDEX

8

[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001

d1@t1a x bb a y

(ax,d1@t1),(ay,d1@t1),…(bb,d1@t1),(ba,d1@t1),…(xb,d1@t1)

{a,b,x,y}

(2)

Page 17: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times

APRIORI-SCAN & APRIORI-INDEX

8

[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001

Page 18: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times

APRIORI-SCAN & APRIORI-INDEX

8

[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001

APRIORI-INDEX

(2)ab d5@t5 [2,7] d7@t7 [1,11]

bx d5@t5 [8] d7@t7 [2]

abx d5@t5 [7] d7@t7 [1]

Page 19: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times

APRIORI-SCAN & APRIORI-INDEX

8

[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001

APRIORI-INDEX

(2)ab d5@t5 [2,7] d7@t7 [1,11]

bx d5@t5 [8] d7@t7 [2]

abx d5@t5 [7] d7@t7 [1]

(3)abx d8@t8 [2,7] d9@t9 [1,11]

bxy d8@t8 [3]

abxy d8@t8 [2]

Page 20: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Challenges & Desiderata

✦ Single MapReduce Job(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)

✦ Communication Cost(N-GRAM COUNT ✗ / APRIORI-SCAN ✓ / APRIORI-INDEX ✓)

✦ Main-Memory Consumption(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)

✦ Ease of Implementation(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)

9

Page 21: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Outline

✦ Motivation

✦ Competitors & Challenges

✦ SUFFIX-σ

✦ Extensions

✦ Experimental Evaluation

✦ Conclusion

10

Page 22: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

SUFFIX-σ

✦ SUFFIX-σ is based on three key ideas, inspired by methods from String Processing (e.g., suffix arrays)

✦ emit only suffixes of documents in map() to reduce communication cost

✦ partition() suffixes based on their first word

✦ sort suffixes in reverse lexicographic orderto limit main-memory consumption in reduce()

11

Page 23: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Suffixes

✦ SUFFIX-σ emits only suffixes of documents in map()

✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)

12

d1@t1a x bb a y

(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)

Page 24: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Suffixes

✦ SUFFIX-σ emits only suffixes of documents in map()

✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)

12

d1@t1a x bb a y

(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)

d1@t1a x bb a y

(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)

Page 25: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Partitioning

✦ SUFFIX-σ partitions suffixes based on their first word

✦ brings together suffixes representing same n-gram

✦ crucial for computation in single MapReduce job

13

(axbbay,d1@t1)(xbbay,d1@t1)(yyabbx,d3@t3)(yabbx,d3@t3)(axbyyx,d4@t4)(xbyyx,d4@t4)

(axbbay,d1@t1)(axbyyx,d4@t4)

(xbbay,d1@t1)(xbyyx,d4@t4)

(yyabbx,d3@t3)(yabbx,d3@t3)

Page 26: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Sorting

✦ SUFFIX-σ sorts suffixes in reverse lexicographic order

✦ bookkeeping using stack of bounded height σ

✦ crucial for low main-memory consumption

14

(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)

Page 27: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Sorting

✦ SUFFIX-σ sorts suffixes in reverse lexicographic order

✦ bookkeeping using stack of bounded height σ

✦ crucial for low main-memory consumption

14

(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)

(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)

Page 28: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Sorting

✦ SUFFIX-σ sorts suffixes in reverse lexicographic order

✦ bookkeeping using stack of bounded height σ

✦ crucial for low main-memory consumption

14

(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)

(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)

yabbxa

{d1@t1}

{d4@t4}{d7@d7}

Page 29: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Sorting

✦ SUFFIX-σ sorts suffixes in reverse lexicographic order

✦ bookkeeping using stack of bounded height σ

✦ crucial for low main-memory consumption

14

(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)

(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)

yabbxa

{d1@t1}

{d4@t4}{d7@d7}

(axbbay,1)(axbba,1)(axbb,1)(axb,2)(ax,3)

Page 30: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Sorting

✦ SUFFIX-σ sorts suffixes in reverse lexicographic order

✦ bookkeeping using stack of bounded height σ

✦ crucial for low main-memory consumption

14

(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)

(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)

axbba

{d5@t5}

{d1@t1, d4@t4, d7@t7}

Page 31: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

SUFFIX-σ

15

map(did,  content):   for  all  suffixes  in  content:     emit(suffix,  did)

partition(suffix,  did):   return  suffix[0]  %  m

compare(suffix0,  suffix1):   return  -­‐strcmp(suffix0,  suffix1)

Page 32: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Outline

✦ Motivation

✦ Competitors & Challenges

✦ SUFFIX-σ

✦ Extensions

✦ Experimental Evaluation

✦ Conclusion

16

Page 33: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Extensions

✦ Closed/Maximal n-Grams

✦ SUFFIX-σ can emit only prefix-closed/maximal n-grams in reduce(); additional MapReduce job then identifies suffix-closed/maximal n-grams

✦ Other Aggregations

✦ n-gram time series

✦ n-gram inverted index

17

[1] J.-B. Michel et al.: Quantitative Analysis of Culture Using Millions of Digitized Books, Science 2010

a b

a b c

b c x

d2, d7, d9

d2, d7

d3, d6

Page 34: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Outline

✦ Motivation

✦ Competitors & Challenges

✦ SUFFIX-σ

✦ Extensions

✦ Experimental Evaluation

✦ Conclusion

18

Page 35: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Datasets & Setup

✦ The New York Times Annotated Corpus (NYT)1.8 million newspaper articles, 1987 – 2007, ~3 GB

✦ ClueWeb09-B (CW)50 million web documents, 2009, ~246 GB

✦ 10 Cluster Nodes (2x6 cores, 64 GB RAM, 4x2 TB HDD, Debian 5.0.9, 1 Gbit Ethernet, CDH3u0)

✦ Implementation operates on compressed integer sequences; datasets pre-processed accordingly

19

Page 36: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Use Cases

✦ Training a Statistical Language Model (LM)

✦ σ = 5 (i.e., n-grams consisting of up to five words)

✦ τ = 10 (NYT) / τ = 100 (CW)

✦ Identifying Repeated Text Fragments (RT)

✦ σ = 100 (to also capture quotations, idioms, etc.)

✦ τ = 100 (NYT) / τ = 1,000 (CW)

20

Page 37: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Results (LM)

21

1

10

100

1,000

10,000

NYT CW

81

3

3,809

37

240

9

309

10

Wal

lclo

ck T

ime

(min

utes

)

N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ

Page 38: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Results (RT)

22

1

10

100

1,000

10,000

NYT CW

229

5

393

62

338

77

15,000

117

Wal

lclo

ck T

ime

(min

utes

)

N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ

Page 39: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Outline

✦ Motivation

✦ Competitors & Challenges

✦ SUFFIX-σ

✦ Extensions

✦ Experimental Evaluation

✦ Conclusion

23

Page 40: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Conclusion

✦ SUFFIX-σ – computes n-gram statistics in MapReduce

✦ based on “suffix idea” from String Processing

✦ robust to wide variety of parameter choices

✦ outperforms state-of-the-art competitors

✦ runs in a single MapReduce job, consumes little main memory, and is easy to implement

24

Page 41: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26

Advertisements

✦ Codehttp://github.com/kberberi/mpiingrams

✦ EU ProjectLongitudinal Analytics of Web Archive Data

✦ Follow-Up WorkI. Miliaraki, K. Berberich, R. Gemulla, S. Zoupanos: Mind the Gap: Large-Scale Frequent Sequence Mining,SIGMOD 2013

25

Page 42: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen

Computing n-Gram Statistics in MapReduce – Klaus Berberich / 2626

Thank you!

Questions ?