Upload
trenton-dominey
View
212
Download
0
Tags:
Embed Size (px)
Citation preview
World class IT in a world-wide marketWorld class IT in a world-wide market
Text Mining Highlights Text Mining Highlights
Marten Trautwein
Syllogic
Research & Development
RoadMapRoadMap
• TextHub – A parallel information retrieval tool
• Text Mine– A document clustering extension
• Emile– Grammar induction & clustering
What is TextHub?What is TextHub?• Intelligent Parallel Information Retrieval Tool
• Intuitive Web based graphical user interface
• Compression Decompression
• Indexing Retrieval
• Document clustering & categorization
The star topologyThe star topology
• Master receives requests• Master delegates tasks
• Slave performs tasks
• Master collects results• Master returns answer
0:07
0:21
0:36
0:50
1:04
4 nodes 8 nodes 16 nodes
1 Gbyte
500 Mbyte
250 Mbyte
Use of parallelismUse of parallelism
• Documents outnumber processors
• Divide and conquer• Distribute documents• Communication
overhead minimum• Linear speed-up (1GB
per hour)
Functionality detailsFunctionality details
• Compression / Decompression– Canonical Huffman encoding
• Indexing– Inverted file index with canonical terms
• Retrieval – Boolean (AND, OR, MINUS)
– Search modifiers (stemming, case folding, stop list, synonyms, semantic network)
– Proximity (AT, FAR, NEAR)
• Relevance ranking– Score documents
Retrieval (Boolean)Retrieval (Boolean)Type Query Document contains
Phrase Alice in Wonderland This phrase
AND Alice AND Wonderland Both terms
OR Alice OR Wonderland Either term or both
MINUS Alice MINUS Wonderland First term but notsecond
Retrieval (Search modifiers)Retrieval (Search modifiers)Modifier Effect
Stemming walked walk, queens queen
Case folding Queen queen
Stoplist The queen queen
Synonyms monarch queen
Semantic network queen woman
Retrieval (Proximity)Retrieval (Proximity)Type Query Document contains
terms separated byAT Alice AT/1 Wonderland Exactly 1 word
FAR Alice FAR/10 Wonderland At least 10 words
NEAR Alice NEAR/4 Wonderland At most 4 words
Relevance rankingRelevance ranking
• Rate relevance of document• Score based on number of
occurrences• Score compensated for
large documents
• TextHub marks where document is relevant
Text Mine - Document clusteringText Mine - Document clustering
• Improve relevance feed-back
• Clustering of related documents
• Categorization of documents
• Minimum spanning tree algorithm
Using minimum spanning treeUsing minimum spanning tree
• Combine different measures• Ordinary query retrieves relevant nodes• Nodes serve as entry-points• No global minimum spanning tree V
U
T
S
C
D
A
B
F
E
?
EmileEmile
• In coorparation with University of Amsterdam• Engine enabling
– Grammar induction
– Knowledge base construction
– Compound term separation
• Language independent
Grammar inductionGrammar induction
• Fragment of Phaistos disk1 41 40 7.
2 12 4 40 33.
2 12 6 18 *.
2 12 13 1.
2 12 13 1 18.
2 12 27 14 32 18 27.
2 12 27 35 37 21.
2 12 31 26.
2 12 32 23 38.
2 12 41 19 35.
2 27 25 10 23 18.
…
16 14 18.
16 23 18 43.
• Fragment of grammar[0] --> [3] .
[3] --> [16] [47]
[14] --> 15 [40]
[14] --> 2 12
[16] --> 2 [57] 25 10 23
[16] --> [14] 13 1
[16] --> 16 14
[40] --> 7
[40] --> 29
[47] --> 18
[47] --> 24 40
[57] --> 27
[57] --> 29
Knowledge base constructionKnowledge base construction
Dictionary Type [35]
K033
k033
K105
k33
Dictionary Type [87]
Vrachtgeb
vrachtgeb
Vrachtgebouw
Vracht
Dictionary Type [89]
CGOADTP6
Printqueue
Dictionary Type [114]
is
Userid
Password
Dictionary Type [138]
status
Error
Dictionary Type [196]
scarlos
vrachtbrieven
Dictionary Type [215]
G239
g239
Dictionary Type [237]
enorm
ontzettend
super
Dictionary Type [290]
pingen
benaderen
Emile on Biomed (1)Emile on Biomed (1)
0
500
10001500
2000
2500
3000
35004000
4500
500010
%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Number of differentsentences read
Number of differentwords
Emile on Biomed (2)Emile on Biomed (2)
0
50000
100000
150000
200000
250000
300000
350000
400000
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Number of differentcontexts
Number of differentexpressions
Emile on Biomed (3)Emile on Biomed (3)
0
20
40
60
80
100
120
140
160
18010
%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Number of differentgrammatical types
Number ofdictionary types
Emile outcomeEmile outcome
[16] --> School of Medicine , University of Washington , Seattle 98195 , USA
[16] --> University of Kitasato Hospital , Sagamihara , Kanagawa , Japan
[16] --> Heinrich-Heine-University , Dusseldorf , Germany
[16] --> School of Medicine , Chiba University
[5] --> Department of Urology , [16]
[94] --> Chinese
[94] --> Japanese
[94] --> Polish
[101] --> 32 : Cancer Res 1996 Oct
[101] --> 35 : Genomics 1996 Aug
[101] --> 44 : Cancer Res 1995 Dec
[101] --> 50 : Cancer Res 1995 Feb
[101] --> 54 : Eur J Biochem 1994 Sep
[101] --> 58 : Cancer Res 1994 Mar
[105] --> identified in 13 cases ( 72
[105] --> detected in 9 of 87 informative cases ( 10
[105] --> observed in 5 ( 55
[11] --> LOH was [105] %