Upload
lars-juhl-jensen
View
150
Download
0
Tags:
Embed Size (px)
Citation preview
Lars Juhl Jensen
Text-mining practical
the task
named entity recognition
human proteins
link proteins to diseases
what I have done
information retrieval
two diseases
prostate cancer
schizophrenia
two sets of documents
62,755 abstracts
65,588 abstracts
one directory with each set
one file with each abstract
dictionary
tab-delimited file
human proteins
22,523 entities
synonyms
from many databases
orthographic variation
prefixes and postfixes
automatically generated
2,726,495 names
tagdir program
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
what you will do
named entity recognition
find unfortunate names
create “black list”
information extraction
co-mentioning
within documents
link proteins to diseases
link between the diseases
a helping hand
“black list”
100+ matches
10+ matches
wrap up
prostate cancer
FOLH1
schizophrenia
Glutamate carboxypeptidase II
same protein
synonyms matter
“black list” is crucial
text mining is useful
not black magic