View 442
Download 3
Category
Tags:
Preview:
Citation preview
Lars Juhl Jensen
Text mining exercise
~5 m
the task
named entity recognition
human proteins
link proteins to diseases
what I have done
information retrieval
two diseases
prostate cancer
schizophrenia
two sets of documents
62,755 abstracts
65,588 abstracts
one directory with each set
one file with each abstract
dictionary
tab-delimited file
22,523 entities
synonyms
from many databases
orthographic variation
prefixes and postfixes
automatically generated
2,726,495 names
tagdir program
flexible matching
upper- and lower-case
spaces and hyphens
tab-delimited output
what you will do
find unfortunate names
create “black list”
information extraction
co-mentioning
within documents
link between the diseases
a helping hand
“black list”
100+ matches
10+ matches
wrap up
FOLH1
Glutamate carboxypeptidase II
same protein
synonyms matter
“black list” is crucial
text mining is quite simple
diseases.jensenlab.org
Recommended