World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research...

Preview:

Citation preview

World class IT in a world-wide marketWorld class IT in a world-wide market

Text Mining Highlights Text Mining Highlights

Marten Trautwein

Syllogic

Research & Development

RoadMapRoadMap

• TextHub – A parallel information retrieval tool

• Text Mine– A document clustering extension

• Emile– Grammar induction & clustering

What is TextHub?What is TextHub?• Intelligent Parallel Information Retrieval Tool

• Intuitive Web based graphical user interface

• Compression Decompression

• Indexing Retrieval

• Document clustering & categorization

The star topologyThe star topology

• Master receives requests• Master delegates tasks

• Slave performs tasks

• Master collects results• Master returns answer

0:07

0:21

0:36

0:50

1:04

4 nodes 8 nodes 16 nodes

1 Gbyte

500 Mbyte

250 Mbyte

Use of parallelismUse of parallelism

• Documents outnumber processors

• Divide and conquer• Distribute documents• Communication

overhead minimum• Linear speed-up (1GB

per hour)

Functionality detailsFunctionality details

• Compression / Decompression– Canonical Huffman encoding

• Indexing– Inverted file index with canonical terms

• Retrieval – Boolean (AND, OR, MINUS)

– Search modifiers (stemming, case folding, stop list, synonyms, semantic network)

– Proximity (AT, FAR, NEAR)

• Relevance ranking– Score documents

Retrieval (Boolean)Retrieval (Boolean)Type Query Document contains

Phrase Alice in Wonderland This phrase

AND Alice AND Wonderland Both terms

OR Alice OR Wonderland Either term or both

MINUS Alice MINUS Wonderland First term but notsecond

Retrieval (Search modifiers)Retrieval (Search modifiers)Modifier Effect

Stemming walked walk, queens queen

Case folding Queen queen

Stoplist The queen queen

Synonyms monarch queen

Semantic network queen woman

Retrieval (Proximity)Retrieval (Proximity)Type Query Document contains

terms separated byAT Alice AT/1 Wonderland Exactly 1 word

FAR Alice FAR/10 Wonderland At least 10 words

NEAR Alice NEAR/4 Wonderland At most 4 words

Relevance rankingRelevance ranking

• Rate relevance of document• Score based on number of

occurrences• Score compensated for

large documents

• TextHub marks where document is relevant

Text Mine - Document clusteringText Mine - Document clustering

• Improve relevance feed-back

• Clustering of related documents

• Categorization of documents

• Minimum spanning tree algorithm

Using minimum spanning treeUsing minimum spanning tree

• Combine different measures• Ordinary query retrieves relevant nodes• Nodes serve as entry-points• No global minimum spanning tree V

U

T

S

C

D

A

B

F

E

?

EmileEmile

• In coorparation with University of Amsterdam• Engine enabling

– Grammar induction

– Knowledge base construction

– Compound term separation

• Language independent

Grammar inductionGrammar induction

• Fragment of Phaistos disk1 41 40 7.

2 12 4 40 33.

2 12 6 18 *.

2 12 13 1.

2 12 13 1 18.

2 12 27 14 32 18 27.

2 12 27 35 37 21.

2 12 31 26.

2 12 32 23 38.

2 12 41 19 35.

2 27 25 10 23 18.

16 14 18.

16 23 18 43.

• Fragment of grammar[0] --> [3] .

[3] --> [16] [47]

[14] --> 15 [40]

[14] --> 2 12

[16] --> 2 [57] 25 10 23

[16] --> [14] 13 1

[16] --> 16 14

[40] --> 7

[40] --> 29

[47] --> 18

[47] --> 24 40

[57] --> 27

[57] --> 29

Knowledge base constructionKnowledge base construction

Dictionary Type [35]

K033

k033

K105

k33

Dictionary Type [87]

Vrachtgeb

vrachtgeb

Vrachtgebouw

Vracht

Dictionary Type [89]

CGOADTP6

Printqueue

Dictionary Type [114]

is

Userid

Password

Dictionary Type [138]

status

Error

Dictionary Type [196]

scarlos

vrachtbrieven

Dictionary Type [215]

G239

g239

Dictionary Type [237]

enorm

ontzettend

super

Dictionary Type [290]

pingen

benaderen

Emile on Biomed (1)Emile on Biomed (1)

0

500

10001500

2000

2500

3000

35004000

4500

500010

%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Number of differentsentences read

Number of differentwords

Emile on Biomed (2)Emile on Biomed (2)

0

50000

100000

150000

200000

250000

300000

350000

400000

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Number of differentcontexts

Number of differentexpressions

Emile on Biomed (3)Emile on Biomed (3)

0

20

40

60

80

100

120

140

160

18010

%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Number of differentgrammatical types

Number ofdictionary types

Emile outcomeEmile outcome

[16] --> School of Medicine , University of Washington , Seattle 98195 , USA

[16] --> University of Kitasato Hospital , Sagamihara , Kanagawa , Japan

[16] --> Heinrich-Heine-University , Dusseldorf , Germany

[16] --> School of Medicine , Chiba University

[5] --> Department of Urology , [16]

[94] --> Chinese

[94] --> Japanese

[94] --> Polish

[101] --> 32 : Cancer Res 1996 Oct

[101] --> 35 : Genomics 1996 Aug

[101] --> 44 : Cancer Res 1995 Dec

[101] --> 50 : Cancer Res 1995 Feb

[101] --> 54 : Eur J Biochem 1994 Sep

[101] --> 58 : Cancer Res 1994 Mar

[105] --> identified in 13 cases ( 72

[105] --> detected in 9 of 87 informative cases ( 10

[105] --> observed in 5 ( 55

[11] --> LOH was [105] %