20
World class IT in a world-wide World class IT in a world-wide market market

World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Embed Size (px)

Citation preview

Page 1: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

World class IT in a world-wide marketWorld class IT in a world-wide market

Page 2: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Text Mining Highlights Text Mining Highlights

Marten Trautwein

Syllogic

Research & Development

Page 3: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

RoadMapRoadMap

• TextHub – A parallel information retrieval tool

• Text Mine– A document clustering extension

• Emile– Grammar induction & clustering

Page 4: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

What is TextHub?What is TextHub?• Intelligent Parallel Information Retrieval Tool

• Intuitive Web based graphical user interface

• Compression Decompression

• Indexing Retrieval

• Document clustering & categorization

Page 5: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

The star topologyThe star topology

• Master receives requests• Master delegates tasks

• Slave performs tasks

• Master collects results• Master returns answer

Page 6: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

0:07

0:21

0:36

0:50

1:04

4 nodes 8 nodes 16 nodes

1 Gbyte

500 Mbyte

250 Mbyte

Use of parallelismUse of parallelism

• Documents outnumber processors

• Divide and conquer• Distribute documents• Communication

overhead minimum• Linear speed-up (1GB

per hour)

Page 7: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Functionality detailsFunctionality details

• Compression / Decompression– Canonical Huffman encoding

• Indexing– Inverted file index with canonical terms

• Retrieval – Boolean (AND, OR, MINUS)

– Search modifiers (stemming, case folding, stop list, synonyms, semantic network)

– Proximity (AT, FAR, NEAR)

• Relevance ranking– Score documents

Page 8: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Retrieval (Boolean)Retrieval (Boolean)Type Query Document contains

Phrase Alice in Wonderland This phrase

AND Alice AND Wonderland Both terms

OR Alice OR Wonderland Either term or both

MINUS Alice MINUS Wonderland First term but notsecond

Page 9: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Retrieval (Search modifiers)Retrieval (Search modifiers)Modifier Effect

Stemming walked walk, queens queen

Case folding Queen queen

Stoplist The queen queen

Synonyms monarch queen

Semantic network queen woman

Page 10: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Retrieval (Proximity)Retrieval (Proximity)Type Query Document contains

terms separated byAT Alice AT/1 Wonderland Exactly 1 word

FAR Alice FAR/10 Wonderland At least 10 words

NEAR Alice NEAR/4 Wonderland At most 4 words

Page 11: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Relevance rankingRelevance ranking

• Rate relevance of document• Score based on number of

occurrences• Score compensated for

large documents

• TextHub marks where document is relevant

Page 12: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Text Mine - Document clusteringText Mine - Document clustering

• Improve relevance feed-back

• Clustering of related documents

• Categorization of documents

• Minimum spanning tree algorithm

Page 13: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Using minimum spanning treeUsing minimum spanning tree

• Combine different measures• Ordinary query retrieves relevant nodes• Nodes serve as entry-points• No global minimum spanning tree V

U

T

S

C

D

A

B

F

E

?

Page 14: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

EmileEmile

• In coorparation with University of Amsterdam• Engine enabling

– Grammar induction

– Knowledge base construction

– Compound term separation

• Language independent

Page 15: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Grammar inductionGrammar induction

• Fragment of Phaistos disk1 41 40 7.

2 12 4 40 33.

2 12 6 18 *.

2 12 13 1.

2 12 13 1 18.

2 12 27 14 32 18 27.

2 12 27 35 37 21.

2 12 31 26.

2 12 32 23 38.

2 12 41 19 35.

2 27 25 10 23 18.

16 14 18.

16 23 18 43.

• Fragment of grammar[0] --> [3] .

[3] --> [16] [47]

[14] --> 15 [40]

[14] --> 2 12

[16] --> 2 [57] 25 10 23

[16] --> [14] 13 1

[16] --> 16 14

[40] --> 7

[40] --> 29

[47] --> 18

[47] --> 24 40

[57] --> 27

[57] --> 29

Page 16: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Knowledge base constructionKnowledge base construction

Dictionary Type [35]

K033

k033

K105

k33

Dictionary Type [87]

Vrachtgeb

vrachtgeb

Vrachtgebouw

Vracht

Dictionary Type [89]

CGOADTP6

Printqueue

Dictionary Type [114]

is

Userid

Password

Dictionary Type [138]

status

Error

Dictionary Type [196]

scarlos

vrachtbrieven

Dictionary Type [215]

G239

g239

Dictionary Type [237]

enorm

ontzettend

super

Dictionary Type [290]

pingen

benaderen

Page 17: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Emile on Biomed (1)Emile on Biomed (1)

0

500

10001500

2000

2500

3000

35004000

4500

500010

%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Number of differentsentences read

Number of differentwords

Page 18: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Emile on Biomed (2)Emile on Biomed (2)

0

50000

100000

150000

200000

250000

300000

350000

400000

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Number of differentcontexts

Number of differentexpressions

Page 19: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Emile on Biomed (3)Emile on Biomed (3)

0

20

40

60

80

100

120

140

160

18010

%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Number of differentgrammatical types

Number ofdictionary types

Page 20: World class IT in a world-wide market. Text Mining Highlights Marten Trautwein Syllogic Research & Development

Emile outcomeEmile outcome

[16] --> School of Medicine , University of Washington , Seattle 98195 , USA

[16] --> University of Kitasato Hospital , Sagamihara , Kanagawa , Japan

[16] --> Heinrich-Heine-University , Dusseldorf , Germany

[16] --> School of Medicine , Chiba University

[5] --> Department of Urology , [16]

[94] --> Chinese

[94] --> Japanese

[94] --> Polish

[101] --> 32 : Cancer Res 1996 Oct

[101] --> 35 : Genomics 1996 Aug

[101] --> 44 : Cancer Res 1995 Dec

[101] --> 50 : Cancer Res 1995 Feb

[101] --> 54 : Eur J Biochem 1994 Sep

[101] --> 58 : Cancer Res 1994 Mar

[105] --> identified in 13 cases ( 72

[105] --> detected in 9 of 87 informative cases ( 10

[105] --> observed in 5 ( 55

[11] --> LOH was [105] %