33
1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

Embed Size (px)

Citation preview

Page 1: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

1

Florian Thomas, Christian ReßMap/Reduce Algorithms on Hadoop

6. Juli 2009

tf/idf computation

Page 2: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

2

tf/idf computation

•Was ist tf/idf?

•Verschiedene Implementierungen

•Map/Reduce-Aufbau

•Implementierungsbesonderheiten

•Analyse

•Fazit

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 3: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

3

Was ist tf/idf?

term frequency

=

d}||{d|t

|D|

i

logiiiiidfidfidfidf

∑=

k k,j

i,j

freq

freqjjjji,i,i,i,tftftftf

(tf - idf) i, j = tf i,j ⋅ idf i

inverse document frequency

tf/idf

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 4: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

4

Phase 1

Phase 2

#

Implementierungen – 2 Phasen

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 5: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

5

Phase 1

Phase 2

#

Wort1: 5

Wort2: 8

Wort3: 2∑

=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 2 Phasen

Page 6: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

6

Phase 1

Phase 2

#

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Wort1: 5

Wort2: 8

Wort3: 2

Summe: 15

∑=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

∑= kkkk jjjjk,k,k,k,jjjji,i,i,i, freqfreqfreqfreqtftftftf i,jfreq

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 2 Phasen

Page 7: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

7

Phase 1

Phase 2

#

ni

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Wort1: 5

Wort2: 8

Wort3: 2

Summe: 15

tf-idfWort,Dokument

idfWort

∑=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

∑= kkkk jjjjk,k,k,k,jjjji,i,i,i, freqfreqfreqfreqtftftftf i,jfreq

⋅= ||||d}d}d}d}tttt||||{d{d{d{d|||| iiii ||log,

Dtf ji

=

d}||{d|tidf

i

i

||||DDDD||||log

iiiijjjji,i,i,i, idfidfidfidfidf)idf)idf)idf)----(tf(tf(tf(tf ⋅= jitf ,

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 2 Phasen

Page 8: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

8

Phase 1

Phase 2

Phase 3

Phase 4

#

ni

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Wort1: 5

Wort2: 8

Wort3: 2

Summe: 15

tf-idfWort,Dokument

idfWort

∑=

k k,jfreqi,j

tfjjjji,i,i,i,freqfreqfreqfreq

∑= kkkk jjjjk,k,k,k,jjjji,i,i,i, freqfreqfreqfreqtftftftf i,jfreq

⋅= ||||d}d}d}d}tttt||||{d{d{d{d|||| iiii ||log,

Dtf ji

=

d}||{d|tidf

i

i

||||DDDD||||log

(tf - idf) i, j = tf i, j ⋅ idf i

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 4 Phasen

Page 9: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

9

Phase 2

Phase 3

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

Summe: 15

Wort1: 5

Wort2: 8

Wort3: 2∑

=

kkkk jjjjk,k,k,k,freqfreqfreqfreqjjjji,i,i,i,freqfreqfreqfreq

i,jtf

∑=

k k,j

i,j

freq

freqjjjji,i,i,i,tftftftf

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 5 Phasen

Page 10: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

10

Phase 2

Phase 3

Phase 4

tfWort1,Dokument

tfWort2,Dokument

tfWort3,Dokument

tf-idfWort,Dokument

idfWort

ni

Phase 5

Summe: 15

Wort1: 5

Wort2: 8

Wort3: 2∑

=

kkkk jjjjk,k,k,k,freqfreqfreqfreqjjjji,i,i,i,freqfreqfreqfreq

i,jtf

∑=

k k,j

i,j

freq

freqjjjji,i,i,i,tftftftf

= ||||d}d}d}d}tttt||||{d{d{d{d||||idfidfidfidf iiiiiiii ||log

D

iji idftf ⋅= ,jjjji,i,i,i,idf)idf)idf)idf)----(tf(tf(tf(tf

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Implementierungen – 5 Phasen

Page 11: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

11

Map/Reduce-Aufbau – 5 Phasen

Phase 1: Dokumente zählen

DokID, Text

„N“, 1

„N“, (1,1,...)

„N“, #Dokumente

Map

Reduce

=

d}||{d|tidf

i

i

||||DDDD||||log

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 12: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

12

Phase 2: Wörter zählen in jedem Dokument

DokID, Text

(Wort, DokID),1 (Wort, DokID),1 DokID, Sum

(Wort, DokID), (1,1,...) (Wort, DokID), (1,1,...) DokID, Sum

(Wort, DokID), freq (Wort, DokID), freq DokID, Sum

Map

Reduce

...

∑=

kkkk jjjjk,k,k,k,jjjji,i,i,i,

freqfreqfreqfreqfreqfreqfreqfreq

i,jtf

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 13: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

13

Phase 3: tf berechnen

(Wort, DokID), freq

DokID, (Wort, freq)

DokID, (Sum, (Wort, freq), (Wort, freq),...)

(Wort, DokID), tf (Wort, DokID), tf (Wort, DokID), tf

Map

Reduce

DokID, Sum

DokID, Sum

... ...

tf i, j =

freqi,j

freqk,jk

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 14: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

14

Phase 4: #DokumenteWort bestimmen, idf berechnen

(Wort, DokID), tf

Wort, (DokID, tf)

Wort, ((DokID, tf), (DokID, tf),...)

Wort, (DokID, tf) Wort, (DokID, tf) Wort, idf

Map

Reduce

...

idf i = log| D |

| {d | ti ∈ d} |

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 15: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

15

Phase 5: tf-idf berechnen

Wort, (DokID, tf)

Wort, (DokID, tf)

Wort, (idf, (DokID, tf), (DokID, tf), ...)

(Wort, DokID), tf-idf (Wort, DokID), tf-idf (Wort, DokID), tf-idf

Map

Reduce

... ...

Wort, idf

Wort, idf

(tf - idf) i, j = tf i, j ⋅ idf i

Map/Reduce-Aufbau – 5 Phasen

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 16: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

16

Motivation: Phase 5

Wort, (DokID, tf)

Wort, (DokID, tf)

Wort, (idf, (DokID, tf), (DokID, tf), ...)

(Wort, DokID), tf-idf (Wort, DokID), tf-idf (Wort, DokID), tf-idf

Map

Reduce

... ...

Wort, idf

Wort, idf

Implementierung – Secondary Sort

Normalerweise: Wort, ((DokID, tf), (DokID, tf), ..., idf, …)

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 17: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

17

Hadoop – normales Verhalten

Implementierung – Secondary Sort

mental (1, 0.4)

mental (3, 0.9)

mental (5, 0.3)

mental (0, 0.7)

mental (2, 0.6)

mental 1.8

retardation (8, 0.5)

retardation (6, 0.1)

retardation (9, 0.4)

retardation (4, 0.8)

retardation 5.1

Wort (DokID, tf)

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 18: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

18

Hadoop – normales Verhalten

Implementierung – Secondary Sort

mental (1, 0.4)

mental (3, 0.9)

mental (5, 0.3)

mental (0, 0.7)

mental (2, 0.6)

mental 1.8

retardation (8, 0.5)

retardation (6, 0.1)

retardation (9, 0.4)

retardation (4, 0.8)

retardation 5.1

Wort (DokID, tf)

→ mental, ((1, 0.4), (3, 0.9), (5, 0.3), (0, 0.7), 1.8, (2, 0.6))

→ retardation, ((8, 0.5), 5.1, (6, 0.1), (9, 0.4), (4, 0.8))

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 19: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

19

Hadoop – gewünschtes Verhalten

Implementierung – Secondary Sort

mental 1 (1, 0.4)

mental 1 (3, 0.9)

mental 1 (5, 0.3)

mental 1 (0, 0.7)

mental 1 (2, 0.6)

mental 0 1.8

retardation 1 (8, 0.5)

retardation 1 (6, 0.1)

retardation 1 (9, 0.4)

retardation 1 (4, 0.8)

retardation 0 5.1

Wort (DokID, tf)

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

• Anhängen eines Index-Wertes an den Schlüssel

Page 20: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

20

Hadoop – gewünschtes Verhalten

Implementierung – Secondary Sort

mental 1 (1, 0.4)

mental 1 (3, 0.9)

mental 1 (5, 0.3)

mental 1 (0, 0.7)

mental 1 (2, 0.6)

mental 0 1.8

retardation 1 (8, 0.5)

retardation 1 (6, 0.1)

retardation 1 (9, 0.4)

retardation 1 (4, 0.8)

retardation 0 5.1

Wort (DokID, tf)

• Anhängen eines Index-Wertes an den Schlüssel

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 21: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

21

Hadoop – gewünschtes Verhalten

Implementierung – Secondary Sort

mental 1 (1, 0.4)

mental 1 (3, 0.9)

mental 1 (5, 0.3)

mental 1 (0, 0.7)

mental 1 (2, 0.6)

mental 0 1.8

retardation 1 (8, 0.5)

retardation 1 (6, 0.1)

retardation 1 (9, 0.4)

retardation 1 (4, 0.8)

retardation 0 5.1

Wort (DokID, tf)

→ mental, (1.8, (1, 0.4), (3, 0.9), (5, 0.3), (0, 0.7), (2, 0.6))

→ retardation, (5.1, (8, 0.5), (6, 0.1), (9, 0.4), (4, 0.8))

• Anhängen eines Index-Wertes an den Schlüssel

• Anpassen der Gruppierung

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Page 22: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

22

Tupel

Implementierung – Eigene Datentypen

Tuple<F,S>

F FirstS Second

getFirst(): F;getSecond(): S;readFields(DataInput);write(DataOutput);compareTo(Tuple): int;toString(): String;

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Comparator

compare(byte[], int, int, byte[], int, int): int

Page 23: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

23

Tupel

Implementierung – Eigene Datentypen

Tuple<F,S>

F FirstS Second

getFirst(): F;getSecond(): S;readFields(DataInput);write(DataOutput);compareTo(Tuple): int;toString(): String;

TupleTextLong TupleLongDouble

tf/idf computation | Florian Thomas, Christian Reß | 6. Juli 2009

Comparator

compare(byte[], int, int, byte[], int, int): int

Page 24: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

Speicherauslastung – 2 Phasen

� Phase 1� Map: konstant� Reduce: konstant – 1 long

� Phase 2� Map: 1 Dokument, pro Wort ein float� Reduce: | {Wort ∈ Korpus} | * (Wort, Long, Double)

� Abschätzung Obergrenze:

| {Wort ∈ Korpus} | = | Korpus |

1’000’000 * (10+8+8 Byte) ≈ 25 MiB3 GiB RAM ≈ 3,2 Mrd. Dokumente

Page 25: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

Speicherauslastung – 4 Phasen

� Phase 1� Map: konstant� Reduce: konstant – 1 long

� Phase 2� Map: 1 Dokument, 1 int� Reduce: konstant, 1 int

� Phase 3� Map: 1 Tupel (String, Long, Double)� Reduce: { (Wort, tf) } eines Dokuments

� Phase 4� Map: 1 Tupel (String, Long, Double)� Reduce: | {Wort ∈ Korpus} | * (String, Double)

Page 26: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

Speicherauslastung – 5 Phasen

� Phase 1� Map: konstant� Reduce: konstant – 1 long

� Phase 2� Map: 1 Dokument, 1 int� Reduce: konstant, 1 int

� Phase 3� Map: 1 Tupel (String, Long, Long)� Reduce: 1 Tupel (String, Long, Long)

� Phase 4 & Phase 5 � Map: 1 Tupel (String, Long, Double)� Reduce: 1 Tupel (String, Long, Double)

Page 27: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

27

Obergrenze – 5 Phasen

� durch Phase 2 beschränkt

� 1 Wort => max. INT_MAX Vorkommen im Korpus� 1 Dokument => max. LONG_MAX Worte� max. LONG_MAX Dokumente

� Abschätzung: 10^19 * 10^19 * 10Byte = 10 ^ 39 Byte

=> unbegrenzt (10^80 Atome im Universum…)

Page 28: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

28

Laufzeit

Page 29: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

29

Laufzeit

Page 30: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

30

Laufzeit

Page 31: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

31

Laufzeit

Page 32: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

32

Fazit

• beide Varianten bei größeren Korpora anwendbar, max. O(N)

• Vergleich bei großen Korpora nötig (Wikipedia!)

Page 33: tf/idf computation - hpi.de · 1 Florian Thomas, Christian Reß Map/Reduce Algorithms on Hadoop 6. Juli 2009 tf/idf computation

33

Quellen

• Hadoop API Documentationhttp://hadoop.apache.org/core/docs/r0.20.0/api/

• Yahoo! Hadoop Tutorialhttp://developer.yahoo.com/hadoop/tutorial/

• Wikipediahttp://en.wikipedia.org/wiki/Tf-idf

• Introduction to Information Retrievalhttp://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html