View
218
Download
0
Category
Preview:
Citation preview
7/24/2019 1 Text Mining Review Slides.pptx
1/78
TEXT MINING:TECHNIQUES, TOOLS,ONTOLOGIES AND SHARED TASKS
1
14 Spring
7/24/2019 1 Text Mining Review Slides.pptx
2/78
Introduction
Text mining, also referred to as text data mining, refersto the process of deriving high quality information fromtext.
Text mining is an interdisciplinary eld that draws oninformation retrieva, !ata minin", ma#$ineearnin", %tati%ti#%and #om&'tationa in"'i%ti#%.
Text mining techniques have een applied in a largenumer of areas, such as usiness intelligence,national security, scientic discovery !especially lifescience", social media monitoring and etc..
#
7/24/2019 1 Text Mining Review Slides.pptx
3/78
Introduction
In this set of slides, we are going to cover$ the most commonly used text mining
techniques
%ntologies that are often used in text mining
%pen source text mining tools
Shared tas&s in text mining which re'ect thehot topics in this area
( research case which applies text miningtechniques to solve a healthcare relatedprolem with social media data.
)
7/24/2019 1 Text Mining Review Slides.pptx
4/78
TEXT MININGTECHNIQUESText *lassicationSentiment (nalysisTopic +odeling
amed -ntity ecognition
-ntity elation -xtraction
4
7/24/2019 1 Text Mining Review Slides.pptx
5/78
Text *lassication
Text *lassication or text categori/ation is a prolemin lirary science, information science, and computerscience. Text classication is the tas& of choosingcorrect class lael for a given input.
Some examples of text classication tas&s are 0eciding whether an email is a spam or not !%&am
!ete#tion" .
0eciding whether the topic of a news article is from a xedlist of topic areas such as sports2, technology2, andpolitics2 !!o#'ment #a%%i(#ation".
0eciding whether a given occurrence of the word bankisused to refer to a river an&, a nancial institution, the act oftilting to the side, or the act of depositing something in anancial institution !)or! %en%e !i%am*i"'ation". 3
7/24/2019 1 Text Mining Review Slides.pptx
6/78
Text *lassication
Text classication is a %'&ervi%e! ma#$ine earnin" tas& as itis uilt ased on training corpora containing the correct lael foreach input. The framewor& for classication is shown in gureelow.
!a" 0uring training, a feature extractor is used to convert each input value to a feature set. These feature sets,which capture the asic information aout each input that should e used to classify it, are discussed in thenext section. 6airs of feature sets and laels are fed into the machine learning algorithm to generate a model.
!" 0uring prediction, the same feature extractor is used to convert unseen inputs to feature sets. Thesefeature sets are then fed into the model, which generates predicted laels.
7/24/2019 1 Text Mining Review Slides.pptx
7/78
Text *lassication
*ommon features for text classication include$ag7of words !8%9", igrams, tri7grams and part7of7speech!6%S" tags for each word in the document.
The most commonly adopted machine learning
algorithms for text classications are na+ve a-e%,%'&&ort ve#tor ma#$ine%, and ma.im'mentro&- classications.
:
A"orit$m Lan"'a"e Too%
a;ve 8ayesSupport ?ector+achines
*@@ S?+7light, myS?+, =iS?+
+at=a S?+ Toolox
7/24/2019 1 Text Mining Review Slides.pptx
8/78
Sentiment (nalysis Sentiment analysis !also &nown as opinion mining" refers
to the use of natural language processing, text analysisand computational linguistics to identify and extractsuAective information in source material.
The rise of social media such as forums, micro loggingand logs has fueled interest in sentiment analysis. %nline reviews, ratings and recommendations in social media sites
have turned into a &ind of virtual currency for usinesses loo&ing to
mar&et their products, identifying new opportunities and managetheir reputations.
(s usinesses loo& to automate the process of ltering out thenoise, identifying relevant content and understanding reviewersB
opinions, sentiment analysis is the right technique. C
7/24/2019 1 Text Mining Review Slides.pptx
9/78
Sentiment (nalysis
The main tas&s, their descriptions and approaches aresummari/ed in the tale elow$
D
Ta%/!e%#ri&tion A&&roa#$e%
e.i#on%0a"orit$m%
6olarity*lassication
classifying a given text at the document,sentence, or featureEaspect level intopositive, negative or neutral
lexicon asedscoring
Senti9ordet,=I9*
machinelearningclassication S?+
(Fect(nalysis
*lassifying a given text into aFect statessuch as GangryG, GsadG, and GhappyG
lexicon asedscoring 9ordet7(Fectmachinelearningclassication S?+
SuAectivity(nalysis
*lassifying a given text into two classes$oAective and suAective
lexicon asedscoring
Senti9ordet,=I9*
machinelearningclassication S?+
HeatureE(spect 8ased
(nalysis
0etermining the opinions or sentimentexpressed on diFerent features or aspectsof entities !e.g., the screenfeatureJ of a
cell phone entityJ"
amed entityrecognition @entity relation
detection
Senti9ordet,=I9*, 9ordet
S?+%pinion 0etecting the holder of a sentiment !i.e. amed entity Senti9ordet,
7/24/2019 1 Text Mining Review Slides.pptx
10/78
Topic +odeling Topic models are a suite of algorithms for discovering the main
themes that pervade a large and otherwise unstructuredcollection of documents.
Topic +odeling algorithms include =atent Semantic
(nalysis!=S(", 6roaility =atent Semantic Indexing !6=SI", and=atent 0irichlet (llocation !=0(". (mong them, Latent Diri#$et Ao#ation !LDA" is the most
commonly used nowadays.
Topic modeling algorithms can e applied to massive collectionsof documents. ecent advances in this eld allow us to analy/e streaming
collections, li&e you might nd from a 9e (6I.
Topic modeling algorithms can e adapted to many &inds of 1L
7/24/2019 1 Text Mining Review Slides.pptx
11/78
Topic +odeling 7 =0(
11
The gure elow shows the intuitions ehind atent Diri#$et ao#ation1 9e assume thatsome numer of topics2, which are distriutions over words, exist for the whole collection !farleft". -ach document is assumed to e generated as follows. Hirst choose a distriution over thetopics !the histogram at right"M then, for each word, choose a topic assignment !the coloredcoins" and choose the word from the corresponding topic .
7/24/2019 1 Text Mining Review Slides.pptx
12/78
Topic +odeling 7 =0(
1#
The gure elow show real inference with =0(. 1LL7topic =0( model is tted to1:,LLL articles from Aournal Science. (t left are the inferred topic proportions for theexample article in previous gure. (t right are the top 13 most frequent words fromthe most frequent topics found in this article.
7/24/2019 1 Text Mining Review Slides.pptx
13/78
Topic +odeling 7 ToolsName
Mo!e0A"orit$m
Lan"'a"e A't$or Note%
lda7c=atent 0irichletallocation
* 0. 8leiThis implements variational inferencefor =0(.
class7sldaSupervised topicmodels forclassication
*@@ *. 9angImplements supervised topic modelswith a categorical response.
lda pac&age forNis samplingin many models
7/24/2019 1 Text Mining Review Slides.pptx
14/78
amed -ntity ecognition amed entity refers to anything that can e referred to with
a proper name.
amed entity recognition aims to Hind spans of text that constitute proper names
*lassify the entities eing referred to according to their type
14
T-&e Sam&eCate"orie% E.am&e
6eople Individuals, ctional*haracters
Turingis often considered to e the father of moderncomputer science.
%rgani/ation *ompanies, parties Amazonplans to use drone copters for deliveries.
=ocation +ountains, la&es, seas The highest point in the CatalinasisMount Lemmonatan elevation of D,13: feet aove sea level.
Neo76olitical *ountries, states,provinces
The *atalinas, are located north, and northeastof Tucson,Arizona, United States.
Hacility 8ridges, airports In the late 1D4Ls, Chicago Midway was the usiestairport in the Onited States y total aircraft operations.
?ehicles 6lanes, trains, cars The updated Mini Cooper retains its charm and agility.
In practice, named entity recognition can e extended to types that are not in thetale aove, such as temporal expressions !time and dates", genes, proteins,medical related concepts !disease, treatment and medical events" and etc..
7/24/2019 1 Text Mining Review Slides.pptx
15/78
amed -ntity ecognition
amed entity recognition techniques can ecategori/ed into &nowledge7ased approachesand machine learning ased approaches.
13
Cate"or- A!vanta"e Di%a!vanta"e Too% 0Ontoo"-
>nowledge7asedapproach
equire littletraining data
*reating lexicon
manually is time7consuming andexpensiveMencoded&nowledge mighte importaleacross domains.
Genera Entit- T-&e%
9ordet=exicons created y expertsMe!i#a !omain:N(T-!Oniversity of Shereld"O+=S!ational lirary of +edicine"
+ed=--!%riginally from *olumiaOniversity, commericali/ed now"
+achinelearningapproach7 *onditionalandom Hield!*H"7 Kidden+ar&ov +odel!K++"
educedhuman eFortinmaintainingrules anddictionaries
6repared a set ofannotated trainingdata
Con!itiona Ran!om 2ie! too%Stanford-*H@@+alletHi!!en Mar/ov Mo!e too%+allet
atural=anguage Tool&it!=T>"
http://wordnet.princeton.edu/http://gate.ac.uk/http://www.nlm.nih.gov/research/umls/http://healthfidelity.com/columbia-grants-health-fidelity-exclusive-license-to-medlee-nlphttp://nlp.stanford.edu/software/CRF-NER.shtmlhttp://nlp.stanford.edu/software/CRF-NER.shtmlhttps://code.google.com/p/crfpp/http://mallet.cs.umass.edu/http://mallet.cs.umass.edu/http://nltk.org/http://nltk.org/http://nltk.org/http://nltk.org/http://mallet.cs.umass.edu/http://mallet.cs.umass.edu/https://code.google.com/p/crfpp/https://code.google.com/p/crfpp/http://nlp.stanford.edu/software/CRF-NER.shtmlhttp://nlp.stanford.edu/software/CRF-NER.shtmlhttp://healthfidelity.com/columbia-grants-health-fidelity-exclusive-license-to-medlee-nlphttp://www.nlm.nih.gov/research/umls/http://gate.ac.uk/http://wordnet.princeton.edu/7/24/2019 1 Text Mining Review Slides.pptx
16/78
-ntity elation -xtraction
-ntity relation extraction discerns therelationships that exist among the entitiesdetected in a text. -ntity relation extractiontechniques are applied in a variety of areas. Puestion (nswering
-xtracting entities and relational patterns for answeringfactoid question
HeatureE(spect ased Sentiment (nalysis -xtract relational patterns among entity, features and
sentiments in text !entity, feature, sentiment".
+ining io7medical texts 6rotein inding relations useful for drug discovery
0etection of gene7disease relations from iomedical literature
Hinding drug7side eFect relations in health social media1
7/24/2019 1 Text Mining Review Slides.pptx
17/78
-ntity elation -xtraction
-ntity relation extraction approaches can ecategori/ed into three types
1:
Cate"or- Met$o! A!vanta"e Di%a!vanta"e
Too%
*o7occurrence(nalysis
If two entities co7occur within certaindistance, they areconsidered to have arelation
Simplicity and'exiilityM highrecall
=ow precisionMcant deciderelation types
ule7asedapproaches
*reate rules forrelation extractionased on syntacticand semantic
information in thesentences
Neneral, 'exileM =owerportaility acrossdiFerentdomains+anualencoding ofsyntactic andsemantic rules
Syntacticinformation$Stanford 6arserM%pen=6M
Semanticinformation$0omain>nowledge ases
Supervised=earning
Heature7asedmethods$ featurerepresentation
>ernel7asedmethods$
=ittle or nomanualdevelopment of
rules andtemplates
(nnotatedcorpora isrequired.
0an 8i&elBsparserM+ST parserM
Stanford parserMS?+ classier$
7/24/2019 1 Text Mining Review Slides.pptx
18/78
Supervised =earning (pproaches for-ntity elation -xtraction
1C
Supervised learning approach rea&s relation extractioninto two sutas&s !relation detection and relationclassication". -ach tas& is a text classicationprolem.
Supervised learning approach can e categori/ed yfeature ased methods and &ernel ased methods.
*lassier 1$
0etect when arelation ispresent
etween twoentities
*lassier #$*lassify the
relation types
Sentences Text (nalysis!6%S, 6arse Trees"
>ernel
Hunction
Heature-xtraction
*lassier
Heature asedmethods
>ernel ased methods
7/24/2019 1 Text Mining Review Slides.pptx
19/78
Supervised =earning (pproach to-ntity elation -xtraction
Heature ased methods rely on features torepresent instances for classication. Thefeatures for relation extraction can e categori/edinto$-ntity7ased features 9ord7ased features Syntactic features
-ntity types of the twocandidate arguments
8ag7of7words and ag7of7igrams etween entities
6resence of particularconstructions in a constituentstructure
*oncatenation of the twoentity types
Stemmed version of 8ag7of7words and ag7of7igramsetween entities
*hun& ased7phrase paths
Keadwords 9ords and stems immediatelypreceding and following theentities
8ags of chun& heads
8ag7of7words from thearguments
0istance in words etweenthe arguments
0ependency7tree paths
umer of entities etweenthe arguments
*onstituent7tree paths
Tree distance etween thearguments
1D
7/24/2019 1 Text Mining Review Slides.pptx
20/78
Supervised =earning (pproach to-ntity elation -xtraction
>ernel7ased methods are an eFective alternative toexplicit feature extraction. They retain the original representation of oAects and use
the oAect only via computing a &ernel function etween a
pair of oAects. >ernel >!x,y" denes similarity etween oAects x
and y implicitly in a higher dimensional space.*ommonly used &ernel functions for relation
extractions are$
#L
A't$or Kerne% De%#ri&tion No!e attri*'te%Qelen&o et
al. #LL)
Shallow 6arse Tree
>ernel Ose shallow parse trees
entity type,word,
6%S tag
*ulotta etal. #LL4
0ependency tree&ernel
Ose dependency parsetrees
9ord, 6%S,Nenerali/ed 6%S,*hun& tag, -ntity
Type, -ntity level
8unescu et
al. #LL3
Shortest dependency
path &ernel
shortest path etweenentities in a dependency
tree
9ord, 6%S,Nenerali/ed 6%S,
-ntity Type
7/24/2019 1 Text Mining Review Slides.pptx
21/78
%ntology %ntology represents &nowledge as a set of concepts with a
domain, using a shared vocaulary to denote types, properties,and interrelationships of those concepts.
In text mining, ontology is often used to extract named entities,detect entity relations and conduct sentiment analysis.*ommonly used ontologies are listed in the tale elow$
#1
Name Creator De%#ri&tion A&&i#ation
9ordet 6rinceton Oniversity ( large lexical dataase of -nglish.
9ord sensedisamiguation
Text summari/ationText similarity analysis
Senti9ordet(ndrea -suli, Hari/ioSeastian
Senti9ordet a lexical resource for opinionmining.
Sentiment analysis
=inguistic Inquiry and 9ord *ount!=I9*"
7/24/2019 1 Text Mining Review Slides.pptx
22/78
9ordet
9ordet is an online lexical dataase inwhich -nglish nouns, vers, adAectives andadvers are organi/ed into sets of
synonyms. -ach word represents a lexicali/ed concept.
Semantic relations lin& the synonym sets!synsets".
9ordet contains more than 11C,LLLdiFerent word forms and more than DL,LLLsenses.
(pproximately 1:R of the words in 9ordet areol semous !have more than on sense"M 4LR##
7/24/2019 1 Text Mining Review Slides.pptx
23/78
9ordet Six semantic relations are presented in 9ordet ecause they apply roadly
throughout -nglish and ecause a user need not have advanced training inlinguistics to understand them. The tale elow shows the includedsemantic relations.
9ordet has een used for a numer of diFerent purposes in informationsystems, including word sense disamiguation, information retrieval, text
classication, text summari/ation, machine translation and semantic textualsimilarity analysis .#)
Semanti# Reation S-nta#ti# Cate"or- E.am&e%
Synonymy!similar"
oun, ?er, (dAective,(dver
6ipe, tueise, ascentSad, happy
apidly, speedily
(ntonymy!opposite"
(dAective, (dver 9et, dry6owerful, powerless
apidly, slowly
Kyponymy!suordinate"
oun +aple, treeTree, plant
+eronymy!part"
oun 8rim, hatShip, 'eet
Troponomy!manner"
?er +arch, wal&9hisper, spea&
-ntailment ?er 0rive, ride0ivorce, marry
7/24/2019 1 Text Mining Review Slides.pptx
24/78
Senti9ordet Senti9ordet is a lexical resource explicitly devised for
supporting sentiment analysis and opinion mining applications.
Senti9ordet is the result of the automatic annotation of allthe synsets of 9ordet according to the notions of positivity2,negativity2 and oAectivity2.
-ach of the positivity2, negativity2 and oAectivity2 scoresranges in the interval L.L,1.LJ, and their sum is 1.L for eachsynset.
#4
The gure aove shows the graphical representation adopted ySenti9ordet for representing the opinion7related properties of a term sense.
7/24/2019 1 Text Mining Review Slides.pptx
25/78
Senti9ordet
In Senti9ordet, diFerent senses of the sameterm may have diFerent opinion7relatedproperties.
#3The gure aove shows the visuali/ation of opinion related properties of the term estimablein Senti9ordet !http$EEsentiwordnet.isti.cnr.itEsearch.phpqestimale ".
Searchterm
Sense 1
Sense #
Sense )
6ositivity,oAectivity and
negativity score
Synonym ofestimableinthis sense
http://sentiwordnet.isti.cnr.it/search.php?q=estimablehttp://sentiwordnet.isti.cnr.it/search.php?q=estimable7/24/2019 1 Text Mining Review Slides.pptx
26/78
=inguistic Inquiry and 9ord *ount!=I9*"
=inguistic Inquiry and 9ord *ount !=I9*" is a textanalysis program that loo&s for and counts word inpsychology7relevant categories across text les.
-mpirical results using =I9* demonstrate its aility todetect meaning in a wide variety of experimentalsettings, including to show attentiona fo#'%,emotionait-, %o#ia reation%$i&%, t$in/in"%t-e%, and in!ivi!'a !i3eren#e%.
=I9* is often adopted in =6 applications for sentimentanalysis, aFect analysis, deception detection and etc..
#
7/24/2019 1 Text Mining Review Slides.pptx
27/78
=inguistic Inquiry and 9ord *ount!=I9*"
The =I9* program has two maAor components$ theprocessing component and the dictionaries. 6rocessing
%pens a series of text les !posts, logs, essays, novels, and soon"
-ach word in a given text is compared with the dictionary le.
0ictionaries$ the collection of words that dene aparticular category
-nglish dictionary$ over 455,555 words across over 65
categories examined y human experts. +aAor categories$ f'n#tiona )or!%, %o#ia &ro#e%%e%,
a3e#tive &ro#e%%e%, &o%itive emotion, ne"ative emotion,#o"nitive &ro#e%%e%, *ioo"i#a &ro#e%%e%, reativit- andetc..
+ultilingual$ (raic, *hinese, 0utch, Hrench, Nerman, Italian,
6ortuguese, ussian, Serian, Spanish and Tur&ish. #:
7/24/2019 1 Text Mining Review Slides.pptx
28/78
=inguistic Inquiry and 9ord *ount!=I9*"
#C
=I9*categori
es
=I9*results
from inputtext
=I9* resultsfrom personal
text and formalwriting for
comparison
Input text$ ( postfrom a 4L year oldfemale memer in(merican 0iaetes(ssociation online
community
=I9* online demo$
http://www.liwc.net/tryonlineresults.php7/24/2019 1 Text Mining Review Slides.pptx
29/78
Onied +edical =anguage System!O+=S"
The Onied +edical =anguage System !O+=S" is arepository of *iome!i#a vo#a*'arie% developedy the OS ational =irary of +edicine.
O+=S integrates over #.3 million names for DLL,331concepts from more than L families of iomedicalvocaularies, as well as 1# million relations among theseconcepts.
%ntologies integrated in the O+=S +etathesaurus includethe NCI ta.onom-, Nene %ntology !GO", the +edicalSuAect Keadings !MeSH", %nline +endelian Inheritance in+an!OMIM", Oniversity of 9ashington 0igital (natomistsymolic &nowledge ase !U7DA" and Systemati/edomenclature of +edicineU*linical Terms!SNOMED CT".#D
7/24/2019 1 Text Mining Review Slides.pptx
30/78
Onied +edical =anguage System!O+=S"
)L
Name Creator De%#ri&tion A&&i#ation
ational *enter for 8iotechnology Information !*8I"Taxonomy
ational =irary of +edicine(ll of the or"ani%m%in pulic sequencedataase
Identifyorganisms
Oniversity of 9ashington 0
igital (natomist Source Information !O90("
Oniversity of 9ashingtonStructural Informatics Nroup
Symolic models of the %tr'#t're%andrelationships that constitute the human
ody.
Identify terms inanatomy
Nene %ntology!N%" Nene %ntology *onsortiumGene &ro!'#t characteristics and geneproduct annotation data
Nene productannotation
+edical SuAect Keadings!+eSK"
ational =irary of +edicine?ocaulary thesaurus used for indexingarticles for 8'*Me!
*over terms iniomedicalliterature
%nline +endelian Inheritance in +an!%+I+"
+c>usic&7athans Instituteof Nenetic +edicine
7/24/2019 1 Text Mining Review Slides.pptx
31/78
(ccessing O+=S data o fee associated, license agreement required
(vailale for research purposes, restrictions apply forother &inds of applications
O+=S related tools +etamorphoSys !command line program"
O+=S installation wi/ard and customi/ation tool
Selecting concepts from a given su7domain
Selecting the preferred name of concepts
+eta+ap !
7/24/2019 1 Text Mining Review Slides.pptx
32/78
+ed-Fect
+ed-Fect is the *anada ?igilance (dverse eaction%nline 0ataase, which contains information aoutsuspected adverse reactions to health products.
eport sumitted y consumers and health professionals *ontaining a complete list of medications, adverse
reactions and drug indications !medical conditions for legituse of medication"
+ed-Fect is often used in healthcare research forannotating medications and adverse reactions fromtext !=eaman et al. #L1LM *hee et al. #L11".
)#
7/24/2019 1 Text Mining Review Slides.pptx
33/78
*onsumer Kealth?ocaulary !*K?"
*onsumer Kealth ?ocaulary !*K?" is a lexicon lin&ingO+=S standard medical terms to health consumervocaulary.
=aypeople have diFerent vocaulary from healthcare professionalsto descrie medical prolems.
*K? helps to ridge the communication gap etween consumersand healthcare professionals y mapping the O+=S standardmedical terms to consumer health language.
It has een applied in prior studies to etter understandand match user expressions for medical entity extractionin social media !Vang et al. #L1#M 8enton et al. #L11".
))
7/24/2019 1 Text Mining Review Slides.pptx
34/78
H0(Bs (dverse -vent eportingSystem !H(-S"
H0(Bs (dverse -vent eporting System!H(-S"documents adverse drug event reports and drugindications of all the medical products in OS mar&et.
eports sumitted y consumers, health professionals,pharmaceutical companies and researchers.
*ontaining complete list of medical products in OnitedStates and their suspected adverse reactions
H(-S has een applied in healthcare research formedical named entity recognitions and adversedrug event extractions !8ian et al. #L1#, =iu et al.#L1)".
)4
7/24/2019 1 Text Mining Review Slides.pptx
35/78
A9 LIST O2 O8EN SOURCENL8 TOOLKITS
)3
7/24/2019 1 Text Mining Review Slides.pptx
36/78
)
Name Main 2eat're%Lan"'a"
eCreator%
7e*%ite
Anteo&eframe)or/
6art7of7speech tagging, dependency parsing, 9ordetlexicon
*W, ?8.net
6roxem 1J
A&erti'm+achine translation for language pairs from Spanish,-nglish, Hrench, 6ortuguese, *atalan and %ccitan
*@@,
7/24/2019 1 Text Mining Review Slides.pptx
37/78
):
Name Main 2eat're%Lan"'a"
eCreator% 7e*%ite
Learnin" a%e!=ava
6%S tagger, *hun&ing, coreference resolution, namedentity recognition
7/24/2019 1 Text Mining Review Slides.pptx
38/78
)C
Name Main 2eat're% Lan"'a"e Creator% 7e*%ite
O&enNL8To&eni/ation, sentence segmentation, 6%S tagging, namedentity extraction, chun&ing, parsing, coreference resolution
7/24/2019 1 Text Mining Review Slides.pptx
39/78
)D
Name Main 2eat're%Lan"'a"
eCreator% 7e*%ite
UIMAIndustry standard for content analytics, contains a set ofrule ased and machine learning annotators and tools
7/24/2019 1 Text Mining Review Slides.pptx
40/78
SHARED TASKS ;COM8ETITIONS< IN
HEALTHCARE AND NATURE LANGUAGE8ROCESSING DOMAINS
4L
7/24/2019 1 Text Mining Review Slides.pptx
41/78
Introduction Shared tas& series in ature =anguage 6rocessing often represent a
community7wide trend and hot topics which are not fully explored in thepast.
To &eep up with the state7of7the7art techniques and new research topics in=6 community, we explore maAor conferences, wor&shops, special
interest groups elonging to (ssociation for *omputational =inguistics!(*=".
9e organi/e our ndings into two categories$ ongoing shared tas&s andwatch list. %ngoing list contains competitions that have already made tas& descriptions, data and
schedules for #L14 pulicly availale. International 9or&shop on Semantic -valuation !Sem-val"
*=-H eKealth -valuation =a
9atch list contains competitions that havenBt made content availale ut are relevantto our interests.
*onference on ature =anguage =earning !*o==" Shared Tas&s
7/24/2019 1 Text Mining Review Slides.pptx
42/78
Sem-val
%verview Sem-val, International 9or&shop on Semantic -valuation,
is an ongoing series with evaluation of computationalsemantic analysis systems. It evolved from the Sens-val!word sense evaluation" series.
SIN=-^, a Special Interest Nroup on =exicon of the(ssociation for *omputational =inguistics, is the umrellaorgani/ation for the Sem-val.
Sem-val7 #L14 will e the Cthwor&shop on semanticevaluation. The wor&shop will e co7located with the #3thInternational *onference on *omputational =inguistics!*%=IN" in 0ulin, Ireland.
4#
7/24/2019 1 Text Mining Review Slides.pptx
43/78
Sem-val 6ast wor&shops
4)
Workshop No. ofTasks Areas of study Languages of Data Evaluated
Senseval-
1(1998)3 Word Sense Disambiguation (WSD) - Lexical Sample WSD tasks nglis!" #renc!" $talian
Senseval-
%(%&&1)1%
Word Sense Disambiguation (WSD) - Lexical Sample" 'll Words"
ranslation WSD tasks
*ec!" Dutc!" nglis!" stonian"
+as,ue" !inese" Danis!" nglis!"
$talian" apanese" .orean" Spanis!"
S/edis!
Senseval-
3(%&&0)1
Logic #orm rans2ormation" ac!ine ranslation () valuation"
Semantic 4ole Labeling" WSD
+as,ue" atalan" !inese" nglis!"
$talian" 4omanian" Spanis!
Semval-%&&5 19
ross-lingual" #rame xtraction" $n2ormation xtraction" Lexical
Substitution" Lexical Sample" eton6m6" Semantic 'nnotation"
Semantic 4elations" Semantic 4ole Labeling" Sentiment 'nal6sis"
ime xpression" WSD
'rabic" atalan" !inese" nglis!"
Spanis!" urkis!
Semval-%&1& 18
o-re2erence" ross-lingual" llipsis" $n2ormation xtraction"
Lexical Substitution" eton6m6" 7oun ompounds" arsing"
Semantic 4elations" Semantic 4ole Labeling" Sentiment 'nal6sis"
extual ntailment" ime xpressions" WSD
atalan" !inese" Dutc!" nglis!"
#renc!" erman" $talian" apanese"
Spanis!
Semval-%&1% 8
ommon Sense 4easoning" Lexical Simpli2ication" 4elational
Similarit6" Spatial 4ole Labeling" Semantic Dependenc6 arsing"
Semantic and extual Similarit6
!inese" nglis!
Semval-%&13 10
emporal 'nnotation" Sentiment 'nal6sis" Spatial 4ole Labeling"
7oun ompounds" !rasal Semantics" extual Similarit6"
4esponse 'nal6sis" ross-lingual extual ntailment" +ioedical
exts" ross and ulti-lingual WSD" Word Sense $nduction" and
Lexical Sample
atalan" #renc!" erman" nglis!"
$talian" Spanis!
7/24/2019 1 Text Mining Review Slides.pptx
44/78
Sem-val7#L14
44
Ta%/ ID Ta%/ Name De%#ri&tion Data
1
-valuation ofcompositionaldistriutionalsemantic models!*0S+s" on fullsentences
Sutas& ($ predicting the degree ofrelatedness etween two sentencesSutas& 8$ detecting the entailmentrelation holding etween them
1L,LLL -nglish sentence pairs, eachannotated for relatedness score in meaningand the entailment relation !entail,contradiction, and neutral" etween the twosentences.
#
Nrammar
Induction forSpo&en 0ialogueSystems
*reating clusters consisting ofsemantically similar fragments.Hor example, the following two
fragments$ depart from _*ity`2 and'y out of _*ity`2 are in the samecluster as they refer to the concept ofdeparture city.
Training data will cover two domains$ air
travel and tourism.The data will e availale in two languages$Nree& and -nglish.
)*ross7levelsemantic similarity
-valuating similarity across diFerentsi/es of text$ paragraph to sentence,sentence to phrase, phrase to word andword to sense.
Information aout data hasnXt eenreleased yet.
4(spect 8asedSentiment (nalysis
Sutas& 1$ (spect term extractionSutas& #$ (spect term polaritySutas& )$ (spect category detectionSutas& 4$ (spect category polarity
Two domain7specic datasets !restaurant
reviews and laptop reviews", consisting ofover ,3LL sentences with ne7grainedaspect7level human7authored annotationswill e provided.
3
=# writing
assistant
8uild a translation assistance systemthat concerns the translation offragments of one language !=1", i.e.words or phrases in a second language!=#" context.
Hor example, input!=1Hrench,=#-nglish"$ I rentre la
The data set covers the following =1 and =#pairs $ -nglish7Nerman, -nglish7Spanish,Hrench7-nglish and 0utch7-nglish.
The trial data contains 3LL sentences foreach language pair. Information aout
7/24/2019 1 Text Mining Review Slides.pptx
45/78
Sem-val7#L14
43
Ta%/ID Ta%/ Name De%#ri&tion Data
Spatial oot*ommands
6arse spatial root commands using datafrom an annotated corpus, collected from asimplied bloc&s worldB game!http$EEwww.trainroot.com"
In trial data, each natural languagecommand is annotated into root
command.G+ove the lue loc& on top of the greyloc&.G is laeled as!event$ !action$ move" !entity$ !color$ lue"!type$ cue"" !destination$ !spatial7relation$!relation$ aove" !entity$ !color$ gray" !type$cue"""""
: (nalysis of*linical Text
*omine supervised methods forentityEacronymEareviation recognition and
mapping to O+=S *OIs !*oncept OniqueIdentiers" with unsupervised discovery andsense induction of theentitiesEacronymsEareviations.
Information aout data hasnXt eenreleased yet.
C
8road7*overageand *ross7Hramewor&Semantic0ependency
6arsing
This tas& see&s to stimulate moregenerali/ed semantic dependency parsingand give a more direct analysis of bwho didwhat to whomB from sentences.
In trial data, 1DC sentences from 9S< areannotated with the desired semanticrepresentation.
DSentiment(nalysis forTwitter
Sutas& ( 7 *ontextual 6olarity0isamiguation$ Niven a message containinga mar&ed instance of a word or a phrase,determine whether that instance is positive,negative or neutral in that context.Sutas& 8 7 +essage 6olarity *lassication$Niven a message, decide whether themessage is of positive, negative, or neutral
sentiment.
training$ D,:#C Twitter messagesdevelopment$ 1,34 Twitter messages !cane used for training as well"development7test($ ),C14 Twitter messages !*(%T eused for training"development7test 8$ #,LD4 S+S messages
!*(%T e used for training"The annotations and systems will use a
7/24/2019 1 Text Mining Review Slides.pptx
46/78
Sem-val7#L14
Important 0atesTrial data ready %ct. )L, #L1)
Training data ready 0ec. 13, #L1)
Test data ready +ar. 1L, #L14 -valuation end +ar. )L, #L14
6aper sumission due (pr. )L, #L14
6aper reviews due +ay. )L, #L14
*amera ready due
7/24/2019 1 Text Mining Review Slides.pptx
47/78
*=-H eKealth -valuation=a
%verview The *=-H Initiative !*onference and =as of the -valuation
Horum," is a self7organi/ed ody whose main mission is topromote research, innovation, and development ofinformation access systems with an emphasis on
multilingual and multimodal information with various levelsof structure.
Started from #LLL, the *=-H aims to stimulateinvestigation and research in a wide range of &ey areas in
the information retrieval domain, ecoming well7&nown inthe international I community. The results weretraditionally presented and discussed at annual wor&shopsin conAunction with the -uropean *onference for 0igital=iraries !-*0=", now called Theory and 6ractice on 0igital
=iraries !T60=". 4:
*=-H K lth - l ti
7/24/2019 1 Text Mining Review Slides.pptx
48/78
*=-H eKealth -valuation=a
%verview In Vear #L1), *=-H started eKealth -valuation =a, a
shared tas& focused on natural languageprocessing!=6" and information retrieval !I" for
clinical care.
The *=-H eKealth -valuation =a #L1) has threetas&s$ (nnotation of disorder mentions spans from clinical reports
(nnotation of acronymEareviation mention spans fromclinical reports
Information retrieval on medical related we documents
4C
7/24/2019 1 Text Mining Review Slides.pptx
49/78
*=-H eKealth #L14
4D
Ta%/ID
Task Description Data
1
?isual7InteractiveSearch and-xplorationof eKealth0ata
Sutas& ($ visuali/e discharge summary togetherwith the disorder standardi/ation and shorthandexpansion data in an eFective andunderstandale way for laypeopleSutas& 8$design a visual exploration approachthat will provide an eFective overview over alarger set of possily relevant documents to meetthe patientBs information need.
de7identied discharge summaries and3L real patient search queries genereatedfrom the discharge summary
#
Informationextractionfromclinical text
0evelop annotated data, resources, methods thatma&e clinical documents easier to understandfrom nurses and patientsB perspective.1L diFerent attriutes$ egation Indicator, SuAect*lass, Oncertainty Indicator, *ourse *lass,Severity *lass, *onditional *lass, Neneric *lass,8ody =ocation, 0ocTime *lass, and Temporal-xpression, should e captured from clinical textand classied into certain value slot.
( set of de7identied clinical reports areprovided y the +I+I* II dataase.( training set of )LL reports and theirdiseaseEdisorder mention templates withlled attriute$ value slots will e provided.( test set of #LL reports and theirdiseaseEdisorder mention templates withdefault7lled attriute$ value slots will eprovided will e provided for the Tas& #challenge one wee& efore the runsumission deadline.
)
Oser7centeredhealthinformation
retrieval
Sutas& ($ monolingual information retrieval tas&7retrieve the relevant medical documents for theuser queriesSutas& 8$ multilingual information retrieval tas& 7
Nerman, Hrench and */ech.
( set of medical7related documents in fourlanguages !-nglish, Nerman, Hrench and*/ech" are provided y the >hresmoiproAect !approximately 1 million medicaldocuments for each language". 3 trainingqueries and 3L test queries are provided.
7/24/2019 1 Text Mining Review Slides.pptx
50/78
*=-H eKealth #L14
Important 0ates *=-H#L14 =a registration opens ov #L1)
Tas& data release egins ov. 13 #L1)
6articipant sumission deadline$ nalsumission to e evaluated +ay L1 #L14
esults released
7/24/2019 1 Text Mining Review Slides.pptx
51/78
*o==
%verview *o==, the *onference on atural =anguage =earning is a yearly
meeting of Special Interest Nroup on ature =anguage =earning!SIN==" of the (ssociation for *omputational =inguistics !startedfrom 1DD:".
Since 1DDD, *o== has included a shared tas& in which training andtest data is provided y the organi/ers which allows participatingsystems to e evaluated and compared in a systematic way.0escription of the systems and evaluation of their performancesare presented oth at the conference and in the proceedings.
The last *o== was held in (ugust #L1), in Soa, 8ulgaria, -urope.Information aout *o== #L14 and its shared tas& will e releasedin next month.
31
7/24/2019 1 Text Mining Review Slides.pptx
52/78
*o==
ecent shared tas&s from *o==
3#
ear Ta%/ Data Lan"'a"e
#L1) Nrammatical -rror *orrectionational Oniversity of Singapore*orpus of =earner -nglish!OS*=-"
-nglish
#L1#+odeling +ultilingual Onrestricted*oreference in %ntootes
%ntootes dataset from=inguistic 0ata *onsortium
(raic,*hinese,-nglish
#L11+odeling Onrestricted *oreferencein %ntootes
%ntootes dataset from=inguistic 0ata *onsortium
-nglish
#L1L
Sutas& ($ =earning to detectsentences containing uncertaintySutas& 8$ =earning to resolve thein7sentence scope of hedge cues
($ iological astracts and fullarticles fromthe 8ioScope !iomedicaldomain" corpus8$ paragraphs from 9i&ipedia
possilycontaining weasel information
-nglish
#LLDSyntactic and Semantic0ependencies in +ultiple=anguages
0ata with gold standardannotation of syntacticdependency, type ofdependency, frame, role set andsense in multiple languages
-nglish,*atalan,*hinese,*/ech,Nerman,
7/24/2019 1 Text Mining Review Slides.pptx
53/78
]S-+
%verview
7/24/2019 1 Text Mining Review Slides.pptx
54/78
]S-+
]S-+ #L1# shared tas&$ 0escription$ esolving the scope and the focus of
negation
0ata$ Stories y *onan 0oyle, and 9S< 6rop8an& 0ata
!aout C,LLL sentences in total". (ll occurrences ofnegation, their scope and focus are annotated.
]S-+ #L1) shared tas&$ 0escription$ *reate a unied framewor& for the
evaluation of semantic textual similarity modules andcharacteri/e their impact on =6 applications.
The data covers 3 areas$ paraphrase sentence pairs!+Spar", sentence pairs from video descriptions!+Svid", +T evaluation sentence pairs !+Tnews and
+Teuroparl" and gloss pairs !%n9". 34
7/24/2019 1 Text Mining Review Slides.pptx
55/78
8io=6
%verview 8io=6 shared tas&s are organi/ed y the (*=Bs
special Interest Nroup for iomedical naturallanguage processing.
8io=6 #L1) was the twelfth wor&shop oniomedical natural language processing and held inconAunction with the annual (*= or ((*= meeting.
8io=6 shared tas&s are i7annual event held withthe 8io=6 wor&shop started from #LLD. The nextevent will e held in #L13.
33
7/24/2019 1 Text Mining Review Slides.pptx
56/78
8io=6 6ast Shared Tas&sVear Tas& 0ata
eleased0ate
-nd 0ate
#L1)
1. Nenia -vent -xtraction from H&8>nowledge ase construction
H>8 >nowledge ase %ct. #L1# (pr. #L1)
#. *ancer Nenetics 6u+ed =iterature). 6athway *uration 6u+ed astracts4. *orpus (nnotation with Neneegulation %ntology
6u+ed =iterature
3. 8acteria 8iotopes
9epage documents
with generalinformation aoutacteria species
:. Nene egulation etwor& in 8acteria 6u+ed (stracts
#L11
1. N-I( 6u+ed astracts0ec.
#L1L(pr. #L11
#. -pigenetics and 6ost7translational+odications
6u+ed astracts
). Infectious 0iseases 6u+ed astracts4. 8acteria 8iotopes 6u+ed astracts3. 8acteria Interactions 6u+ed astracts. *o7reference 6u+ed astracts:. NeneE6rotein -ntity elations 6u+ed astractsC. Nene renaming 6u+ed astracts
#LLD
1. core event extraction!identify events
concerningwith the given proteins "
6u+ed astracts0ec. 13
#LLC
+ar. )L
#LLD#. -vent enrichment 6u+ed astracts 3
7/24/2019 1 Text Mining Review Slides.pptx
57/78
i## *hallenges
Informatics for Integrating 8iology and the 8edside!i##" is an IK funded ational *enter for 8iomedical*omputing !*8*".
I## center organi/es data challenges to motivate thedevelopment of scalale computational framewor&s toaddress the ottlenec& limiting the translation ofgenomic ndings and hypotheses in model systemsrelevant to human health.
I## challenge wor&shops are held in conAunction with(nnual +eeting of (merican +edical Informatics(ssociation.
3:
7/24/2019 1 Text Mining Review Slides.pptx
58/78
6revious i## *hallenges
ear Ta%/ DataReea%e
DateEn!Date
#L1# Temporal relation extraction -K
7/24/2019 1 Text Mining Review Slides.pptx
59/78
A88LING TEXT MINING IN HEALTH
SOCIAL MEDIA RESEARCH:AN EXAM8LE
3D
- t ti (d 0 - t f
7/24/2019 1 Text Mining Review Slides.pptx
60/78
-xtracting (dverse 0rug -vents fromKealth Social Horums
%nline patient forums can provide valuale supplementaryinformation on drug eFectiveness and side eFects. Those forums cover ar"e an! !iver%e &o&'ationand contain !ata
!ire#t- from &atient%.
6atient forum (0- reports can serve as an economical alternative toexpensive and time7consuming patient7oriented drug safety data
collection proAects. It can help to "enerate ne) #ini#a $-&ot$e%i%, #ro%%9vai!ate
the adverse drug events detected from other data sources, andconduct comparison studies.
L
Post ID Post Content Contain
ADE?
Report
source
9043 I had horriblechest pain [Event]underActos [Treatment]. ADE Patient
12200 From what you have said, it seems thatLantus[Treatment]has had some negative sideeffects related todepression [Event]andmood swings [Event].
ADE Hearsay
25139 I never experiencedfatigue [Event]when usingZocor [Treatment]. Negated
ADE
Patient
34188 When takingZocor [Treatment], I hadheadaches [Event]andbruising [Event]. ADE Patient
63828 Another study of people with multiple risk factors forstroke [Event]found thatLipitor
[Treatment]reduced the risk ofstroke [Event]by 26% compared to those taking a
placebo, the company said.
Drug
Indication
Diabetes
research
7/24/2019 1 Text Mining Review Slides.pptx
61/78
Test 8ed
1
Horum ameumerof 6osts
umer ofTopics
umer of+emer 6roles
Time SpanTotal umerof Sentences
(merican 0iaetes(ssociation
1C4,C:4 #,LC4 ,344#LLD.#7
#L1#.111,)4C,)4
0iaetes Horums 3C,C4 43,C)L 1#,L:3#LL#.#7
#L1#.11),)L),CL4
0iaetes Horum :,444 ,4:4 ),LL:
#LL:.#7
#L1#.11 4##,)33
0iscussionaout diseaseand medical
prolems
0iscussion aoutdisease monitoring
and medicalproducts
- t ti (d 0 - t f
7/24/2019 1 Text Mining Review Slides.pptx
62/78
-xtracting (dverse 0rug -vents fromKealth Social Horums
*hallenges Topics in patient social media cover various sources, including
news and research, hearsay!stories of other people" andpatients experience. edundant and noisy information oftenmas&s patient7experienced (0-s.
*urrently, extracting adverse event and drug relation in patientcomments results in low precision due to confounding with drugindications!=egitimate medical conditions a drug is used for "and negated ADE!contradiction or denial of experiencing (0-s"in sentences.
Solutions 0evelop relation extractor for recogni/ing and extracting
adverse drug event relations.
0evelop a text classier to extract adverse drug event reportsased on patient experience.
#
- tracting (d erse 0rug - ent from
7/24/2019 1 Text Mining Review Slides.pptx
63/78
-xtracting (dverse 0rug -vent fromKealth Social Horums
)
8atient 2or'm Data Coe#tion: collect patient forum data through a we crawler
Data 8re&ro#e%%in"$ remove noisy text including O=, duplicated punctuation, etc,separate post to individual sentences.
Me!i#a entit- e.tra#tion$ identify treatments and adverse events discussed inforum
A!ver%e !r'" event e.tra#tion$ identify drug7event pairs indicating an adversedrug event ased on results of medical entity extraction
Re&ort %o'r#e #a%%i(#ation$ classify the source of reported events either frompatient experience or hearsay
7/24/2019 1 Text Mining Review Slides.pptx
64/78
+edical -ntity -xtraction
4
Initiali/e the medical entityextraction with MetaMa&to match terms related todrugs and (0-s in forumdiscussion.
Hilter the terms extractedy +eta+ap that neverappear in 2AERSreports.
Puery Con%'mer Heat$?o#a*'ar-for consumerpreferred terms of theentities extracted y+eta+ap and loo& up thoseconsumer vocaularies in
the discussions.
+eta+apis a
7/24/2019 1 Text Mining Review Slides.pptx
65/78
(dverse 0rug -vent-xtraction
3
>ernel ased statistical learningHeature generation
Nenerate representations of the relationinstances
Syntactic and semantic classesmapping
*ategori/e lexical features into syntactic andsemantic classes to reduce the featuresparsity
Shortest dependency path &ernel
*ompute the similarity score etween tworelation instances
Semantic ltering0rug indications from H(-S
Incorporate medical domain &nowledgefor diFerentiating drug indication fromadverse events
eg-^Incorporate linguistic &nowledge toidentify negated adverse drug events.
Semantic templatesHorm ltering templates using the&nowledge from H(-S and eg-^.
ule ased classication
(dverse 0rug -vent
https://code.google.com/p/negex/https://code.google.com/p/negex/7/24/2019 1 Text Mining Review Slides.pptx
66/78
(dverse 0rug -vent-xtraction
Heature generation 9e utili/ed the Stanford 6arser !http$EEnlp.stanford.eduEsoftwareEstanford7
dependencies.shtml" for dependency parsing.
The gure aove shows the dependency tree of a sentence. In this sentence,hypoglycemia is an adverse event and =antus is a diaetes treatment.Nrammatical relations etween words are illustrated in the gure. Hor instance,bcauseB and bhypoglycemiaB have a relation bdoAB as bhypoglycemiaB is the directoAect of bcauseB. In this relation, bcauseB is the governor and bhypoglycemiaB is thedependent.
(dverse 0rug -vent
7/24/2019 1 Text Mining Review Slides.pptx
67/78
(dverse 0rug -vent-xtraction
:
Syntactic and Semantic *lasses +apping
To reduce the data sparsity and increase the roustness of our method,we expand shortest dependency path y categori/ing words on thepath into syntactic and semantic classes with varying degrees ofgenerality.
9ord classes include part7of7speech !6%S" tags and generali/ed 6%S tags.
6%S tags are extracted with Stanford *ore=6 pac&ages. 9e generali/edthe 6%S tags with 6enn Tree 8an& guidelines for the 6%S tags. Semantictypes !-vent and Treatments" are also used for the two ends of the shortestpath.
Syntactic and Semantic *lasses +apping fromdependency graph
The relation instance in the gure aove is represented as a sequence of features^x1,x#,x),x4,x3,x,x:J,where x1Kypoglycemia, , oun, -vent, x#7`, x)cause, ?8, ?er, x4_7,x3action, , oun, x_7, x:=antus, , oun, Treatment.
(dverse 0rug -vent
7/24/2019 1 Text Mining Review Slides.pptx
68/78
(dverse 0rug -vent-xtraction
C
Shortest 0ependency 6ath >ernel function
If xx1x#xmand yy1y#..ynare two relation examples,where xi denotes the set of word classes corresponding toposition i, the &ernel function is computed as in equationelow !8unescu et al. #LL3".
is the numer of common word classes etween xiand yi.
elation instance ^Kypoglycemia, NN, No'n, Event, 9B, cause, ?8,?er*, 9, action, NN, No'n, 9, =antus, NN, No'n, TreatmentJ.
elation instance ydepression,
NN, No'n, Event,
9B, indicate, ?86,
?er*, _7, eFect, NN, No'n, 9, =antus, NN8, No'n, TreatmentJ.
>!x,y" can e computed as the product of the numer of commonfeatures xi and yi in position i.
>!x,y")]1]1]1]#]1])1C.
||),( iiii yxyxC =
(dverse 0rug -vent
7/24/2019 1 Text Mining Review Slides.pptx
69/78
(dverse 0rug -vent-xtraction
D
S?+ *lassication There are a lot of S?+ softwareEtools have een
developed and commerciali/ed.
(mong them, S?+7light pac&age and =I8S?+ are
two of the most widely used tools. 8oth are freeof charge and can e downloaded from theInternet. S?+7light is availale at http$EEsvmlight.Aoachims.orgE
=I8S?+ can e found athttp$EEwww.csie.ntu.edu.twEcAlinElisvmE
(dverse 0rug -vent
7/24/2019 1 Text Mining Review Slides.pptx
70/78
(dverse 0rug -vent-xtraction
S?+7light
:L
(dverse 0rug -vent
7/24/2019 1 Text Mining Review Slides.pptx
71/78
(dverse 0rug -vent-xtraction
:1
ALGORITHM 1ST(TISTI*(= =-(IN H% (0?-S- 0ON -?-T-^T(*TI%
In&'t:all the relation instances with a pair of related drug and medicalevents, R(drug, event!
O't&'t:whether the instances have a pair of related drug and event
8ro#e!'re:41 2or ea#$ relation instance R(drug,event :
Nenerate 0ependency tree T of R(drug,event
Heatures Shortest Dependency "ath Extraction !T, R"
Heatures Syntactic and Semantic #lasses $apping !Heatures"
#. Separate relation instances into training set and test set
). Train a S?+ classier * with shortest dependency kernel %unction ased onthe training set
4. Ose the S?+ classier * to classify instances in the test set into two classesR(drug, event True and R(drug, event Halse.
(dverse 0rug -vent
7/24/2019 1 Text Mining Review Slides.pptx
72/78
(dverse 0rug -vent-xtraction
:#
ALGORITHM 1S-+(TI* HI=T-IN (=N%ITK+In&'t:a relation instance i with a pair of related drug and
medical events, R(drug, event!
O't&'t:The relation type.
Ifdrug exists inH(-S$
&etindication list fordrugM
2orindication inindication list$
Ifevent indication$
Ret'rnR(drug, event b0rug IndicationBM
2orrule ineg-^$Ifrelation instance i matchesrule$
Ret'rnR(drug, event begated (dverse 0rug -ventBM
Ret'rn R(drug, event b(dverse 0rug -ventBM
7/24/2019 1 Text Mining Review Slides.pptx
73/78
eport Source *lassication
In order to classify the report source of adversedrug events, we developed a feature7asedclassication model to distinguish patient reportsfrom hearsay ased on the prior studies.
9e adopted 8%9 features and TransductiveSupport ?ector +achines in S?+7light forclassication.
:)
-valuation on +edical -ntity
7/24/2019 1 Text Mining Review Slides.pptx
74/78
-valuation on +edical -ntity-xtraction
The performance of our system !H7measure" surpasses the estperformance in prior studies ! H7measure:).DR ", which is achieved
y applying O+=S and +ed-Fect to extract adverse events from0ailyStrength !=eaman et al., #L1L". There may e several causesfor our approach to outperform prior wor&. *omination of multiple lexicons improves precision.
0ailyStrength is a general health social wesite where users may have morediverse health vocaulary and develop more linguistic creativity. -xtracting
medical named entities could e more dicult than our data source. :4
D).DR
C:.)R
D#.3R
C.3R
D1.4R
C3.4R
D1.:R
CL.)R
DL.CR
CL.:R
DL.3R
:D.3R
D#.3R
C).3R
D1.R
C).3R
DL.DR
C#.)R
Re%'t% of Me!i#a Entit- E.tra#tion
6recision ecall f7measure
-valuation on (dverse 0rug -vent
7/24/2019 1 Text Mining Review Slides.pptx
75/78
-valuation on (dverse 0rug -vent-xtraction
*ompared to co7occurrence ased approach !*%", statistical learning
!S=" contriuted to the increase of precision from around 4LR toaove LR while the recall dropped from 1LLR to around LR. H7measure of S= is etter than *% method.
Semantic ltering !SH" further improved the precision in extractionfrom LR to aout CLR y ltering drug indications and negated(0-s.
:3
)C.3R
#.LR
C#.LR
44.CR
4.#R
:C.R
41.3R
#.3R
:3.#R
1LL.LR
3.3R 3.R
1LL.LR
L.4R L.4R
1LL.LR
3C.LR 3C.LR33.R 3D.#R.DR
1.DR #.#R C.)R
3D.R L.#R 3.3R
*% S= S=@SH *% S= S=@SH *% S= S=@SH
(merican0iaetes(ssocia on 0iaetesHorums 0iaetesHorum
Re%'t%ofA!ver%eDr'"EventE.tra# on
6recision .ecall H7measure
-valuation on eport Source
7/24/2019 1 Text Mining Review Slides.pptx
76/78
-valuation on eport Source*lassication
9ithout report source classication !S*", the performance of extraction is
heavily aFected y noise in the discussion. The precision ranged from 31R to #R without S*.
%verall performance !H7measure" ranged from CR to :R
(fter report source classication, the precision and H7measure signicantlyimproved. The precision increased from 31R up to C4R
The overall performance !H7measure " increased from CR to aove CLR.
:
1.3R
C).DR
3#.:R
C1.#R
31.4R
CL.#R
1LL.LR
C4.)R
1LL.LR
C).1R
1LL.LR
C#.4R:.#R
C4.1R
D.LR
C#.1R
:.DR
C1.)R
Re%'t% of Re&ort So'r#e Ca%%i(#ation
6recision ecall H7measure
*ontrast of %ur 6roposed Hramewor& to *o
7/24/2019 1 Text Mining Review Slides.pptx
77/78
*ontrast of %ur 6roposed Hramewor& to *o7occurrence ased approach
::
There are a large numer of false adverse drug events whichcouldnBt e ltered out y co7occurrence ased approach.8ased on our approach , only )3R to 4LR of all the relationinstances contain adverse drug events.(mong them, aout 3LR comes from patient reports.
(merican 0iaetes (ssociation 0iaetes Horums 0iaetes Horum
1LLR 1LLR 1LLR
)3.D:R ):.DCR )D.#:R
#1.D4R 1D.:4R 1C.1LR
Contra%t of O'r 8ro&o%e! 2rame)or/ to Co9o##'rren#e *a%e! a&&roa#$
Total elation Instances (dverse 0rug -vents 6atient eported (0-s
#D:# 1LD 3# F
1)C: 4 45
J4 4J
f
7/24/2019 1 Text Mining Review Slides.pptx
78/78
eferences
]S-+$http$EEixa#.si.ehu.esEstarsemE *o==$http$EEifarm.nlEsignllEconllE
Sem-val$http$EEalt.qcri.orgEsemeval#L14E
*=-H eKealth$ http$EEclefehealth#L14.dcu.ieEhome
8io=6$http$EE#L1).ionlp7st.orgE
I##$https$EEwww.i##.orgE
8enton (., Ongar =., Kill S., Kennessy S., +ao ernel for elation -xtraction. In$ 6roceedings of theconference on Kuman =anguage Technology and -mpirical +ethods in atural =anguage 6rocessing, pp. :#47:)1.
*hee 8. 9., 8erlin ., \ Schat/ 8. !#L11". 6redicting adverse drug events from personal health messages. In$ (+I((nnual Symposium 6roceedings ?ol. #L11, pp. #1:7##
*ulotta, (., \ Sorensen,
Recommended