1 Text Mining Review Slides.pptx

Embed Size (px)

Citation preview

  • 7/24/2019 1 Text Mining Review Slides.pptx

    1/78

    TEXT MINING:TECHNIQUES, TOOLS,ONTOLOGIES AND SHARED TASKS

    1

    14 Spring

  • 7/24/2019 1 Text Mining Review Slides.pptx

    2/78

    Introduction

    Text mining, also referred to as text data mining, refersto the process of deriving high quality information fromtext.

    Text mining is an interdisciplinary eld that draws oninformation retrieva, !ata minin", ma#$ineearnin", %tati%ti#%and #om&'tationa in"'i%ti#%.

    Text mining techniques have een applied in a largenumer of areas, such as usiness intelligence,national security, scientic discovery !especially lifescience", social media monitoring and etc..

    #

  • 7/24/2019 1 Text Mining Review Slides.pptx

    3/78

    Introduction

    In this set of slides, we are going to cover$ the most commonly used text mining

    techniques

    %ntologies that are often used in text mining

    %pen source text mining tools

    Shared tas&s in text mining which re'ect thehot topics in this area

    ( research case which applies text miningtechniques to solve a healthcare relatedprolem with social media data.

    )

  • 7/24/2019 1 Text Mining Review Slides.pptx

    4/78

    TEXT MININGTECHNIQUESText *lassicationSentiment (nalysisTopic +odeling

    amed -ntity ecognition

    -ntity elation -xtraction

    4

  • 7/24/2019 1 Text Mining Review Slides.pptx

    5/78

    Text *lassication

    Text *lassication or text categori/ation is a prolemin lirary science, information science, and computerscience. Text classication is the tas& of choosingcorrect class lael for a given input.

    Some examples of text classication tas&s are 0eciding whether an email is a spam or not !%&am

    !ete#tion" .

    0eciding whether the topic of a news article is from a xedlist of topic areas such as sports2, technology2, andpolitics2 !!o#'ment #a%%i(#ation".

    0eciding whether a given occurrence of the word bankisused to refer to a river an&, a nancial institution, the act oftilting to the side, or the act of depositing something in anancial institution !)or! %en%e !i%am*i"'ation". 3

  • 7/24/2019 1 Text Mining Review Slides.pptx

    6/78

    Text *lassication

    Text classication is a %'&ervi%e! ma#$ine earnin" tas& as itis uilt ased on training corpora containing the correct lael foreach input. The framewor& for classication is shown in gureelow.

    !a" 0uring training, a feature extractor is used to convert each input value to a feature set. These feature sets,which capture the asic information aout each input that should e used to classify it, are discussed in thenext section. 6airs of feature sets and laels are fed into the machine learning algorithm to generate a model.

    !" 0uring prediction, the same feature extractor is used to convert unseen inputs to feature sets. Thesefeature sets are then fed into the model, which generates predicted laels.

  • 7/24/2019 1 Text Mining Review Slides.pptx

    7/78

    Text *lassication

    *ommon features for text classication include$ag7of words !8%9", igrams, tri7grams and part7of7speech!6%S" tags for each word in the document.

    The most commonly adopted machine learning

    algorithms for text classications are na+ve a-e%,%'&&ort ve#tor ma#$ine%, and ma.im'mentro&- classications.

    :

    A"orit$m Lan"'a"e Too%

    a;ve 8ayesSupport ?ector+achines

    *@@ S?+7light, myS?+, =iS?+

    +at=a S?+ Toolox

  • 7/24/2019 1 Text Mining Review Slides.pptx

    8/78

    Sentiment (nalysis Sentiment analysis !also &nown as opinion mining" refers

    to the use of natural language processing, text analysisand computational linguistics to identify and extractsuAective information in source material.

    The rise of social media such as forums, micro loggingand logs has fueled interest in sentiment analysis. %nline reviews, ratings and recommendations in social media sites

    have turned into a &ind of virtual currency for usinesses loo&ing to

    mar&et their products, identifying new opportunities and managetheir reputations.

    (s usinesses loo& to automate the process of ltering out thenoise, identifying relevant content and understanding reviewersB

    opinions, sentiment analysis is the right technique. C

  • 7/24/2019 1 Text Mining Review Slides.pptx

    9/78

    Sentiment (nalysis

    The main tas&s, their descriptions and approaches aresummari/ed in the tale elow$

    D

    Ta%/!e%#ri&tion A&&roa#$e%

    e.i#on%0a"orit$m%

    6olarity*lassication

    classifying a given text at the document,sentence, or featureEaspect level intopositive, negative or neutral

    lexicon asedscoring

    Senti9ordet,=I9*

    machinelearningclassication S?+

    (Fect(nalysis

    *lassifying a given text into aFect statessuch as GangryG, GsadG, and GhappyG

    lexicon asedscoring 9ordet7(Fectmachinelearningclassication S?+

    SuAectivity(nalysis

    *lassifying a given text into two classes$oAective and suAective

    lexicon asedscoring

    Senti9ordet,=I9*

    machinelearningclassication S?+

    HeatureE(spect 8ased

    (nalysis

    0etermining the opinions or sentimentexpressed on diFerent features or aspectsof entities !e.g., the screenfeatureJ of a

    cell phone entityJ"

    amed entityrecognition @entity relation

    detection

    Senti9ordet,=I9*, 9ordet

    S?+%pinion 0etecting the holder of a sentiment !i.e. amed entity Senti9ordet,

  • 7/24/2019 1 Text Mining Review Slides.pptx

    10/78

    Topic +odeling Topic models are a suite of algorithms for discovering the main

    themes that pervade a large and otherwise unstructuredcollection of documents.

    Topic +odeling algorithms include =atent Semantic

    (nalysis!=S(", 6roaility =atent Semantic Indexing !6=SI", and=atent 0irichlet (llocation !=0(". (mong them, Latent Diri#$et Ao#ation !LDA" is the most

    commonly used nowadays.

    Topic modeling algorithms can e applied to massive collectionsof documents. ecent advances in this eld allow us to analy/e streaming

    collections, li&e you might nd from a 9e (6I.

    Topic modeling algorithms can e adapted to many &inds of 1L

  • 7/24/2019 1 Text Mining Review Slides.pptx

    11/78

    Topic +odeling 7 =0(

    11

    The gure elow shows the intuitions ehind atent Diri#$et ao#ation1 9e assume thatsome numer of topics2, which are distriutions over words, exist for the whole collection !farleft". -ach document is assumed to e generated as follows. Hirst choose a distriution over thetopics !the histogram at right"M then, for each word, choose a topic assignment !the coloredcoins" and choose the word from the corresponding topic .

  • 7/24/2019 1 Text Mining Review Slides.pptx

    12/78

    Topic +odeling 7 =0(

    1#

    The gure elow show real inference with =0(. 1LL7topic =0( model is tted to1:,LLL articles from Aournal Science. (t left are the inferred topic proportions for theexample article in previous gure. (t right are the top 13 most frequent words fromthe most frequent topics found in this article.

  • 7/24/2019 1 Text Mining Review Slides.pptx

    13/78

    Topic +odeling 7 ToolsName

    Mo!e0A"orit$m

    Lan"'a"e A't$or Note%

    lda7c=atent 0irichletallocation

    * 0. 8leiThis implements variational inferencefor =0(.

    class7sldaSupervised topicmodels forclassication

    *@@ *. 9angImplements supervised topic modelswith a categorical response.

    lda pac&age forNis samplingin many models

  • 7/24/2019 1 Text Mining Review Slides.pptx

    14/78

    amed -ntity ecognition amed entity refers to anything that can e referred to with

    a proper name.

    amed entity recognition aims to Hind spans of text that constitute proper names

    *lassify the entities eing referred to according to their type

    14

    T-&e Sam&eCate"orie% E.am&e

    6eople Individuals, ctional*haracters

    Turingis often considered to e the father of moderncomputer science.

    %rgani/ation *ompanies, parties Amazonplans to use drone copters for deliveries.

    =ocation +ountains, la&es, seas The highest point in the CatalinasisMount Lemmonatan elevation of D,13: feet aove sea level.

    Neo76olitical *ountries, states,provinces

    The *atalinas, are located north, and northeastof Tucson,Arizona, United States.

    Hacility 8ridges, airports In the late 1D4Ls, Chicago Midway was the usiestairport in the Onited States y total aircraft operations.

    ?ehicles 6lanes, trains, cars The updated Mini Cooper retains its charm and agility.

    In practice, named entity recognition can e extended to types that are not in thetale aove, such as temporal expressions !time and dates", genes, proteins,medical related concepts !disease, treatment and medical events" and etc..

  • 7/24/2019 1 Text Mining Review Slides.pptx

    15/78

    amed -ntity ecognition

    amed entity recognition techniques can ecategori/ed into &nowledge7ased approachesand machine learning ased approaches.

    13

    Cate"or- A!vanta"e Di%a!vanta"e Too% 0Ontoo"-

    >nowledge7asedapproach

    equire littletraining data

    *reating lexicon

    manually is time7consuming andexpensiveMencoded&nowledge mighte importaleacross domains.

    Genera Entit- T-&e%

    9ordet=exicons created y expertsMe!i#a !omain:N(T-!Oniversity of Shereld"O+=S!ational lirary of +edicine"

    +ed=--!%riginally from *olumiaOniversity, commericali/ed now"

    +achinelearningapproach7 *onditionalandom Hield!*H"7 Kidden+ar&ov +odel!K++"

    educedhuman eFortinmaintainingrules anddictionaries

    6repared a set ofannotated trainingdata

    Con!itiona Ran!om 2ie! too%Stanford-*H@@+alletHi!!en Mar/ov Mo!e too%+allet

    atural=anguage Tool&it!=T>"

    http://wordnet.princeton.edu/http://gate.ac.uk/http://www.nlm.nih.gov/research/umls/http://healthfidelity.com/columbia-grants-health-fidelity-exclusive-license-to-medlee-nlphttp://nlp.stanford.edu/software/CRF-NER.shtmlhttp://nlp.stanford.edu/software/CRF-NER.shtmlhttps://code.google.com/p/crfpp/http://mallet.cs.umass.edu/http://mallet.cs.umass.edu/http://nltk.org/http://nltk.org/http://nltk.org/http://nltk.org/http://mallet.cs.umass.edu/http://mallet.cs.umass.edu/https://code.google.com/p/crfpp/https://code.google.com/p/crfpp/http://nlp.stanford.edu/software/CRF-NER.shtmlhttp://nlp.stanford.edu/software/CRF-NER.shtmlhttp://healthfidelity.com/columbia-grants-health-fidelity-exclusive-license-to-medlee-nlphttp://www.nlm.nih.gov/research/umls/http://gate.ac.uk/http://wordnet.princeton.edu/
  • 7/24/2019 1 Text Mining Review Slides.pptx

    16/78

    -ntity elation -xtraction

    -ntity relation extraction discerns therelationships that exist among the entitiesdetected in a text. -ntity relation extractiontechniques are applied in a variety of areas. Puestion (nswering

    -xtracting entities and relational patterns for answeringfactoid question

    HeatureE(spect ased Sentiment (nalysis -xtract relational patterns among entity, features and

    sentiments in text !entity, feature, sentiment".

    +ining io7medical texts 6rotein inding relations useful for drug discovery

    0etection of gene7disease relations from iomedical literature

    Hinding drug7side eFect relations in health social media1

  • 7/24/2019 1 Text Mining Review Slides.pptx

    17/78

    -ntity elation -xtraction

    -ntity relation extraction approaches can ecategori/ed into three types

    1:

    Cate"or- Met$o! A!vanta"e Di%a!vanta"e

    Too%

    *o7occurrence(nalysis

    If two entities co7occur within certaindistance, they areconsidered to have arelation

    Simplicity and'exiilityM highrecall

    =ow precisionMcant deciderelation types

    ule7asedapproaches

    *reate rules forrelation extractionased on syntacticand semantic

    information in thesentences

    Neneral, 'exileM =owerportaility acrossdiFerentdomains+anualencoding ofsyntactic andsemantic rules

    Syntacticinformation$Stanford 6arserM%pen=6M

    Semanticinformation$0omain>nowledge ases

    Supervised=earning

    Heature7asedmethods$ featurerepresentation

    >ernel7asedmethods$

    =ittle or nomanualdevelopment of

    rules andtemplates

    (nnotatedcorpora isrequired.

    0an 8i&elBsparserM+ST parserM

    Stanford parserMS?+ classier$

  • 7/24/2019 1 Text Mining Review Slides.pptx

    18/78

    Supervised =earning (pproaches for-ntity elation -xtraction

    1C

    Supervised learning approach rea&s relation extractioninto two sutas&s !relation detection and relationclassication". -ach tas& is a text classicationprolem.

    Supervised learning approach can e categori/ed yfeature ased methods and &ernel ased methods.

    *lassier 1$

    0etect when arelation ispresent

    etween twoentities

    *lassier #$*lassify the

    relation types

    Sentences Text (nalysis!6%S, 6arse Trees"

    >ernel

    Hunction

    Heature-xtraction

    *lassier

    Heature asedmethods

    >ernel ased methods

  • 7/24/2019 1 Text Mining Review Slides.pptx

    19/78

    Supervised =earning (pproach to-ntity elation -xtraction

    Heature ased methods rely on features torepresent instances for classication. Thefeatures for relation extraction can e categori/edinto$-ntity7ased features 9ord7ased features Syntactic features

    -ntity types of the twocandidate arguments

    8ag7of7words and ag7of7igrams etween entities

    6resence of particularconstructions in a constituentstructure

    *oncatenation of the twoentity types

    Stemmed version of 8ag7of7words and ag7of7igramsetween entities

    *hun& ased7phrase paths

    Keadwords 9ords and stems immediatelypreceding and following theentities

    8ags of chun& heads

    8ag7of7words from thearguments

    0istance in words etweenthe arguments

    0ependency7tree paths

    umer of entities etweenthe arguments

    *onstituent7tree paths

    Tree distance etween thearguments

    1D

  • 7/24/2019 1 Text Mining Review Slides.pptx

    20/78

    Supervised =earning (pproach to-ntity elation -xtraction

    >ernel7ased methods are an eFective alternative toexplicit feature extraction. They retain the original representation of oAects and use

    the oAect only via computing a &ernel function etween a

    pair of oAects. >ernel >!x,y" denes similarity etween oAects x

    and y implicitly in a higher dimensional space.*ommonly used &ernel functions for relation

    extractions are$

    #L

    A't$or Kerne% De%#ri&tion No!e attri*'te%Qelen&o et

    al. #LL)

    Shallow 6arse Tree

    >ernel Ose shallow parse trees

    entity type,word,

    6%S tag

    *ulotta etal. #LL4

    0ependency tree&ernel

    Ose dependency parsetrees

    9ord, 6%S,Nenerali/ed 6%S,*hun& tag, -ntity

    Type, -ntity level

    8unescu et

    al. #LL3

    Shortest dependency

    path &ernel

    shortest path etweenentities in a dependency

    tree

    9ord, 6%S,Nenerali/ed 6%S,

    -ntity Type

  • 7/24/2019 1 Text Mining Review Slides.pptx

    21/78

    %ntology %ntology represents &nowledge as a set of concepts with a

    domain, using a shared vocaulary to denote types, properties,and interrelationships of those concepts.

    In text mining, ontology is often used to extract named entities,detect entity relations and conduct sentiment analysis.*ommonly used ontologies are listed in the tale elow$

    #1

    Name Creator De%#ri&tion A&&i#ation

    9ordet 6rinceton Oniversity ( large lexical dataase of -nglish.

    9ord sensedisamiguation

    Text summari/ationText similarity analysis

    Senti9ordet(ndrea -suli, Hari/ioSeastian

    Senti9ordet a lexical resource for opinionmining.

    Sentiment analysis

    =inguistic Inquiry and 9ord *ount!=I9*"

  • 7/24/2019 1 Text Mining Review Slides.pptx

    22/78

    9ordet

    9ordet is an online lexical dataase inwhich -nglish nouns, vers, adAectives andadvers are organi/ed into sets of

    synonyms. -ach word represents a lexicali/ed concept.

    Semantic relations lin& the synonym sets!synsets".

    9ordet contains more than 11C,LLLdiFerent word forms and more than DL,LLLsenses.

    (pproximately 1:R of the words in 9ordet areol semous !have more than on sense"M 4LR##

  • 7/24/2019 1 Text Mining Review Slides.pptx

    23/78

    9ordet Six semantic relations are presented in 9ordet ecause they apply roadly

    throughout -nglish and ecause a user need not have advanced training inlinguistics to understand them. The tale elow shows the includedsemantic relations.

    9ordet has een used for a numer of diFerent purposes in informationsystems, including word sense disamiguation, information retrieval, text

    classication, text summari/ation, machine translation and semantic textualsimilarity analysis .#)

    Semanti# Reation S-nta#ti# Cate"or- E.am&e%

    Synonymy!similar"

    oun, ?er, (dAective,(dver

    6ipe, tueise, ascentSad, happy

    apidly, speedily

    (ntonymy!opposite"

    (dAective, (dver 9et, dry6owerful, powerless

    apidly, slowly

    Kyponymy!suordinate"

    oun +aple, treeTree, plant

    +eronymy!part"

    oun 8rim, hatShip, 'eet

    Troponomy!manner"

    ?er +arch, wal&9hisper, spea&

    -ntailment ?er 0rive, ride0ivorce, marry

  • 7/24/2019 1 Text Mining Review Slides.pptx

    24/78

    Senti9ordet Senti9ordet is a lexical resource explicitly devised for

    supporting sentiment analysis and opinion mining applications.

    Senti9ordet is the result of the automatic annotation of allthe synsets of 9ordet according to the notions of positivity2,negativity2 and oAectivity2.

    -ach of the positivity2, negativity2 and oAectivity2 scoresranges in the interval L.L,1.LJ, and their sum is 1.L for eachsynset.

    #4

    The gure aove shows the graphical representation adopted ySenti9ordet for representing the opinion7related properties of a term sense.

  • 7/24/2019 1 Text Mining Review Slides.pptx

    25/78

    Senti9ordet

    In Senti9ordet, diFerent senses of the sameterm may have diFerent opinion7relatedproperties.

    #3The gure aove shows the visuali/ation of opinion related properties of the term estimablein Senti9ordet !http$EEsentiwordnet.isti.cnr.itEsearch.phpqestimale ".

    Searchterm

    Sense 1

    Sense #

    Sense )

    6ositivity,oAectivity and

    negativity score

    Synonym ofestimableinthis sense

    http://sentiwordnet.isti.cnr.it/search.php?q=estimablehttp://sentiwordnet.isti.cnr.it/search.php?q=estimable
  • 7/24/2019 1 Text Mining Review Slides.pptx

    26/78

    =inguistic Inquiry and 9ord *ount!=I9*"

    =inguistic Inquiry and 9ord *ount !=I9*" is a textanalysis program that loo&s for and counts word inpsychology7relevant categories across text les.

    -mpirical results using =I9* demonstrate its aility todetect meaning in a wide variety of experimentalsettings, including to show attentiona fo#'%,emotionait-, %o#ia reation%$i&%, t$in/in"%t-e%, and in!ivi!'a !i3eren#e%.

    =I9* is often adopted in =6 applications for sentimentanalysis, aFect analysis, deception detection and etc..

    #

  • 7/24/2019 1 Text Mining Review Slides.pptx

    27/78

    =inguistic Inquiry and 9ord *ount!=I9*"

    The =I9* program has two maAor components$ theprocessing component and the dictionaries. 6rocessing

    %pens a series of text les !posts, logs, essays, novels, and soon"

    -ach word in a given text is compared with the dictionary le.

    0ictionaries$ the collection of words that dene aparticular category

    -nglish dictionary$ over 455,555 words across over 65

    categories examined y human experts. +aAor categories$ f'n#tiona )or!%, %o#ia &ro#e%%e%,

    a3e#tive &ro#e%%e%, &o%itive emotion, ne"ative emotion,#o"nitive &ro#e%%e%, *ioo"i#a &ro#e%%e%, reativit- andetc..

    +ultilingual$ (raic, *hinese, 0utch, Hrench, Nerman, Italian,

    6ortuguese, ussian, Serian, Spanish and Tur&ish. #:

  • 7/24/2019 1 Text Mining Review Slides.pptx

    28/78

    =inguistic Inquiry and 9ord *ount!=I9*"

    #C

    =I9*categori

    es

    =I9*results

    from inputtext

    =I9* resultsfrom personal

    text and formalwriting for

    comparison

    Input text$ ( postfrom a 4L year oldfemale memer in(merican 0iaetes(ssociation online

    community

    =I9* online demo$

    http://www.liwc.net/tryonlineresults.php
  • 7/24/2019 1 Text Mining Review Slides.pptx

    29/78

    Onied +edical =anguage System!O+=S"

    The Onied +edical =anguage System !O+=S" is arepository of *iome!i#a vo#a*'arie% developedy the OS ational =irary of +edicine.

    O+=S integrates over #.3 million names for DLL,331concepts from more than L families of iomedicalvocaularies, as well as 1# million relations among theseconcepts.

    %ntologies integrated in the O+=S +etathesaurus includethe NCI ta.onom-, Nene %ntology !GO", the +edicalSuAect Keadings !MeSH", %nline +endelian Inheritance in+an!OMIM", Oniversity of 9ashington 0igital (natomistsymolic &nowledge ase !U7DA" and Systemati/edomenclature of +edicineU*linical Terms!SNOMED CT".#D

  • 7/24/2019 1 Text Mining Review Slides.pptx

    30/78

    Onied +edical =anguage System!O+=S"

    )L

    Name Creator De%#ri&tion A&&i#ation

    ational *enter for 8iotechnology Information !*8I"Taxonomy

    ational =irary of +edicine(ll of the or"ani%m%in pulic sequencedataase

    Identifyorganisms

    Oniversity of 9ashington 0

    igital (natomist Source Information !O90("

    Oniversity of 9ashingtonStructural Informatics Nroup

    Symolic models of the %tr'#t're%andrelationships that constitute the human

    ody.

    Identify terms inanatomy

    Nene %ntology!N%" Nene %ntology *onsortiumGene &ro!'#t characteristics and geneproduct annotation data

    Nene productannotation

    +edical SuAect Keadings!+eSK"

    ational =irary of +edicine?ocaulary thesaurus used for indexingarticles for 8'*Me!

    *over terms iniomedicalliterature

    %nline +endelian Inheritance in +an!%+I+"

    +c>usic&7athans Instituteof Nenetic +edicine

  • 7/24/2019 1 Text Mining Review Slides.pptx

    31/78

    (ccessing O+=S data o fee associated, license agreement required

    (vailale for research purposes, restrictions apply forother &inds of applications

    O+=S related tools +etamorphoSys !command line program"

    O+=S installation wi/ard and customi/ation tool

    Selecting concepts from a given su7domain

    Selecting the preferred name of concepts

    +eta+ap !

  • 7/24/2019 1 Text Mining Review Slides.pptx

    32/78

    +ed-Fect

    +ed-Fect is the *anada ?igilance (dverse eaction%nline 0ataase, which contains information aoutsuspected adverse reactions to health products.

    eport sumitted y consumers and health professionals *ontaining a complete list of medications, adverse

    reactions and drug indications !medical conditions for legituse of medication"

    +ed-Fect is often used in healthcare research forannotating medications and adverse reactions fromtext !=eaman et al. #L1LM *hee et al. #L11".

    )#

  • 7/24/2019 1 Text Mining Review Slides.pptx

    33/78

    *onsumer Kealth?ocaulary !*K?"

    *onsumer Kealth ?ocaulary !*K?" is a lexicon lin&ingO+=S standard medical terms to health consumervocaulary.

    =aypeople have diFerent vocaulary from healthcare professionalsto descrie medical prolems.

    *K? helps to ridge the communication gap etween consumersand healthcare professionals y mapping the O+=S standardmedical terms to consumer health language.

    It has een applied in prior studies to etter understandand match user expressions for medical entity extractionin social media !Vang et al. #L1#M 8enton et al. #L11".

    ))

  • 7/24/2019 1 Text Mining Review Slides.pptx

    34/78

    H0(Bs (dverse -vent eportingSystem !H(-S"

    H0(Bs (dverse -vent eporting System!H(-S"documents adverse drug event reports and drugindications of all the medical products in OS mar&et.

    eports sumitted y consumers, health professionals,pharmaceutical companies and researchers.

    *ontaining complete list of medical products in OnitedStates and their suspected adverse reactions

    H(-S has een applied in healthcare research formedical named entity recognitions and adversedrug event extractions !8ian et al. #L1#, =iu et al.#L1)".

    )4

  • 7/24/2019 1 Text Mining Review Slides.pptx

    35/78

    A9 LIST O2 O8EN SOURCENL8 TOOLKITS

    )3

  • 7/24/2019 1 Text Mining Review Slides.pptx

    36/78

    )

    Name Main 2eat're%Lan"'a"

    eCreator%

    7e*%ite

    Anteo&eframe)or/

    6art7of7speech tagging, dependency parsing, 9ordetlexicon

    *W, ?8.net

    6roxem 1J

    A&erti'm+achine translation for language pairs from Spanish,-nglish, Hrench, 6ortuguese, *atalan and %ccitan

    *@@,

  • 7/24/2019 1 Text Mining Review Slides.pptx

    37/78

    ):

    Name Main 2eat're%Lan"'a"

    eCreator% 7e*%ite

    Learnin" a%e!=ava

    6%S tagger, *hun&ing, coreference resolution, namedentity recognition

  • 7/24/2019 1 Text Mining Review Slides.pptx

    38/78

    )C

    Name Main 2eat're% Lan"'a"e Creator% 7e*%ite

    O&enNL8To&eni/ation, sentence segmentation, 6%S tagging, namedentity extraction, chun&ing, parsing, coreference resolution

  • 7/24/2019 1 Text Mining Review Slides.pptx

    39/78

    )D

    Name Main 2eat're%Lan"'a"

    eCreator% 7e*%ite

    UIMAIndustry standard for content analytics, contains a set ofrule ased and machine learning annotators and tools

  • 7/24/2019 1 Text Mining Review Slides.pptx

    40/78

    SHARED TASKS ;COM8ETITIONS< IN

    HEALTHCARE AND NATURE LANGUAGE8ROCESSING DOMAINS

    4L

  • 7/24/2019 1 Text Mining Review Slides.pptx

    41/78

    Introduction Shared tas& series in ature =anguage 6rocessing often represent a

    community7wide trend and hot topics which are not fully explored in thepast.

    To &eep up with the state7of7the7art techniques and new research topics in=6 community, we explore maAor conferences, wor&shops, special

    interest groups elonging to (ssociation for *omputational =inguistics!(*=".

    9e organi/e our ndings into two categories$ ongoing shared tas&s andwatch list. %ngoing list contains competitions that have already made tas& descriptions, data and

    schedules for #L14 pulicly availale. International 9or&shop on Semantic -valuation !Sem-val"

    *=-H eKealth -valuation =a

    9atch list contains competitions that havenBt made content availale ut are relevantto our interests.

    *onference on ature =anguage =earning !*o==" Shared Tas&s

  • 7/24/2019 1 Text Mining Review Slides.pptx

    42/78

    Sem-val

    %verview Sem-val, International 9or&shop on Semantic -valuation,

    is an ongoing series with evaluation of computationalsemantic analysis systems. It evolved from the Sens-val!word sense evaluation" series.

    SIN=-^, a Special Interest Nroup on =exicon of the(ssociation for *omputational =inguistics, is the umrellaorgani/ation for the Sem-val.

    Sem-val7 #L14 will e the Cthwor&shop on semanticevaluation. The wor&shop will e co7located with the #3thInternational *onference on *omputational =inguistics!*%=IN" in 0ulin, Ireland.

    4#

  • 7/24/2019 1 Text Mining Review Slides.pptx

    43/78

    Sem-val 6ast wor&shops

    4)

    Workshop No. ofTasks Areas of study Languages of Data Evaluated

    Senseval-

    1(1998)3 Word Sense Disambiguation (WSD) - Lexical Sample WSD tasks nglis!" #renc!" $talian

    Senseval-

    %(%&&1)1%

    Word Sense Disambiguation (WSD) - Lexical Sample" 'll Words"

    ranslation WSD tasks

    *ec!" Dutc!" nglis!" stonian"

    +as,ue" !inese" Danis!" nglis!"

    $talian" apanese" .orean" Spanis!"

    S/edis!

    Senseval-

    3(%&&0)1

    Logic #orm rans2ormation" ac!ine ranslation () valuation"

    Semantic 4ole Labeling" WSD

    +as,ue" atalan" !inese" nglis!"

    $talian" 4omanian" Spanis!

    Semval-%&&5 19

    ross-lingual" #rame xtraction" $n2ormation xtraction" Lexical

    Substitution" Lexical Sample" eton6m6" Semantic 'nnotation"

    Semantic 4elations" Semantic 4ole Labeling" Sentiment 'nal6sis"

    ime xpression" WSD

    'rabic" atalan" !inese" nglis!"

    Spanis!" urkis!

    Semval-%&1& 18

    o-re2erence" ross-lingual" llipsis" $n2ormation xtraction"

    Lexical Substitution" eton6m6" 7oun ompounds" arsing"

    Semantic 4elations" Semantic 4ole Labeling" Sentiment 'nal6sis"

    extual ntailment" ime xpressions" WSD

    atalan" !inese" Dutc!" nglis!"

    #renc!" erman" $talian" apanese"

    Spanis!

    Semval-%&1% 8

    ommon Sense 4easoning" Lexical Simpli2ication" 4elational

    Similarit6" Spatial 4ole Labeling" Semantic Dependenc6 arsing"

    Semantic and extual Similarit6

    !inese" nglis!

    Semval-%&13 10

    emporal 'nnotation" Sentiment 'nal6sis" Spatial 4ole Labeling"

    7oun ompounds" !rasal Semantics" extual Similarit6"

    4esponse 'nal6sis" ross-lingual extual ntailment" +ioedical

    exts" ross and ulti-lingual WSD" Word Sense $nduction" and

    Lexical Sample

    atalan" #renc!" erman" nglis!"

    $talian" Spanis!

  • 7/24/2019 1 Text Mining Review Slides.pptx

    44/78

    Sem-val7#L14

    44

    Ta%/ ID Ta%/ Name De%#ri&tion Data

    1

    -valuation ofcompositionaldistriutionalsemantic models!*0S+s" on fullsentences

    Sutas& ($ predicting the degree ofrelatedness etween two sentencesSutas& 8$ detecting the entailmentrelation holding etween them

    1L,LLL -nglish sentence pairs, eachannotated for relatedness score in meaningand the entailment relation !entail,contradiction, and neutral" etween the twosentences.

    #

    Nrammar

    Induction forSpo&en 0ialogueSystems

    *reating clusters consisting ofsemantically similar fragments.Hor example, the following two

    fragments$ depart from _*ity`2 and'y out of _*ity`2 are in the samecluster as they refer to the concept ofdeparture city.

    Training data will cover two domains$ air

    travel and tourism.The data will e availale in two languages$Nree& and -nglish.

    )*ross7levelsemantic similarity

    -valuating similarity across diFerentsi/es of text$ paragraph to sentence,sentence to phrase, phrase to word andword to sense.

    Information aout data hasnXt eenreleased yet.

    4(spect 8asedSentiment (nalysis

    Sutas& 1$ (spect term extractionSutas& #$ (spect term polaritySutas& )$ (spect category detectionSutas& 4$ (spect category polarity

    Two domain7specic datasets !restaurant

    reviews and laptop reviews", consisting ofover ,3LL sentences with ne7grainedaspect7level human7authored annotationswill e provided.

    3

    =# writing

    assistant

    8uild a translation assistance systemthat concerns the translation offragments of one language !=1", i.e.words or phrases in a second language!=#" context.

    Hor example, input!=1Hrench,=#-nglish"$ I rentre la

    The data set covers the following =1 and =#pairs $ -nglish7Nerman, -nglish7Spanish,Hrench7-nglish and 0utch7-nglish.

    The trial data contains 3LL sentences foreach language pair. Information aout

  • 7/24/2019 1 Text Mining Review Slides.pptx

    45/78

    Sem-val7#L14

    43

    Ta%/ID Ta%/ Name De%#ri&tion Data

    Spatial oot*ommands

    6arse spatial root commands using datafrom an annotated corpus, collected from asimplied bloc&s worldB game!http$EEwww.trainroot.com"

    In trial data, each natural languagecommand is annotated into root

    command.G+ove the lue loc& on top of the greyloc&.G is laeled as!event$ !action$ move" !entity$ !color$ lue"!type$ cue"" !destination$ !spatial7relation$!relation$ aove" !entity$ !color$ gray" !type$cue"""""

    : (nalysis of*linical Text

    *omine supervised methods forentityEacronymEareviation recognition and

    mapping to O+=S *OIs !*oncept OniqueIdentiers" with unsupervised discovery andsense induction of theentitiesEacronymsEareviations.

    Information aout data hasnXt eenreleased yet.

    C

    8road7*overageand *ross7Hramewor&Semantic0ependency

    6arsing

    This tas& see&s to stimulate moregenerali/ed semantic dependency parsingand give a more direct analysis of bwho didwhat to whomB from sentences.

    In trial data, 1DC sentences from 9S< areannotated with the desired semanticrepresentation.

    DSentiment(nalysis forTwitter

    Sutas& ( 7 *ontextual 6olarity0isamiguation$ Niven a message containinga mar&ed instance of a word or a phrase,determine whether that instance is positive,negative or neutral in that context.Sutas& 8 7 +essage 6olarity *lassication$Niven a message, decide whether themessage is of positive, negative, or neutral

    sentiment.

    training$ D,:#C Twitter messagesdevelopment$ 1,34 Twitter messages !cane used for training as well"development7test($ ),C14 Twitter messages !*(%T eused for training"development7test 8$ #,LD4 S+S messages

    !*(%T e used for training"The annotations and systems will use a

  • 7/24/2019 1 Text Mining Review Slides.pptx

    46/78

    Sem-val7#L14

    Important 0atesTrial data ready %ct. )L, #L1)

    Training data ready 0ec. 13, #L1)

    Test data ready +ar. 1L, #L14 -valuation end +ar. )L, #L14

    6aper sumission due (pr. )L, #L14

    6aper reviews due +ay. )L, #L14

    *amera ready due

  • 7/24/2019 1 Text Mining Review Slides.pptx

    47/78

    *=-H eKealth -valuation=a

    %verview The *=-H Initiative !*onference and =as of the -valuation

    Horum," is a self7organi/ed ody whose main mission is topromote research, innovation, and development ofinformation access systems with an emphasis on

    multilingual and multimodal information with various levelsof structure.

    Started from #LLL, the *=-H aims to stimulateinvestigation and research in a wide range of &ey areas in

    the information retrieval domain, ecoming well7&nown inthe international I community. The results weretraditionally presented and discussed at annual wor&shopsin conAunction with the -uropean *onference for 0igital=iraries !-*0=", now called Theory and 6ractice on 0igital

    =iraries !T60=". 4:

    *=-H K lth - l ti

  • 7/24/2019 1 Text Mining Review Slides.pptx

    48/78

    *=-H eKealth -valuation=a

    %verview In Vear #L1), *=-H started eKealth -valuation =a, a

    shared tas& focused on natural languageprocessing!=6" and information retrieval !I" for

    clinical care.

    The *=-H eKealth -valuation =a #L1) has threetas&s$ (nnotation of disorder mentions spans from clinical reports

    (nnotation of acronymEareviation mention spans fromclinical reports

    Information retrieval on medical related we documents

    4C

  • 7/24/2019 1 Text Mining Review Slides.pptx

    49/78

    *=-H eKealth #L14

    4D

    Ta%/ID

    Task Description Data

    1

    ?isual7InteractiveSearch and-xplorationof eKealth0ata

    Sutas& ($ visuali/e discharge summary togetherwith the disorder standardi/ation and shorthandexpansion data in an eFective andunderstandale way for laypeopleSutas& 8$design a visual exploration approachthat will provide an eFective overview over alarger set of possily relevant documents to meetthe patientBs information need.

    de7identied discharge summaries and3L real patient search queries genereatedfrom the discharge summary

    #

    Informationextractionfromclinical text

    0evelop annotated data, resources, methods thatma&e clinical documents easier to understandfrom nurses and patientsB perspective.1L diFerent attriutes$ egation Indicator, SuAect*lass, Oncertainty Indicator, *ourse *lass,Severity *lass, *onditional *lass, Neneric *lass,8ody =ocation, 0ocTime *lass, and Temporal-xpression, should e captured from clinical textand classied into certain value slot.

    ( set of de7identied clinical reports areprovided y the +I+I* II dataase.( training set of )LL reports and theirdiseaseEdisorder mention templates withlled attriute$ value slots will e provided.( test set of #LL reports and theirdiseaseEdisorder mention templates withdefault7lled attriute$ value slots will eprovided will e provided for the Tas& #challenge one wee& efore the runsumission deadline.

    )

    Oser7centeredhealthinformation

    retrieval

    Sutas& ($ monolingual information retrieval tas&7retrieve the relevant medical documents for theuser queriesSutas& 8$ multilingual information retrieval tas& 7

    Nerman, Hrench and */ech.

    ( set of medical7related documents in fourlanguages !-nglish, Nerman, Hrench and*/ech" are provided y the >hresmoiproAect !approximately 1 million medicaldocuments for each language". 3 trainingqueries and 3L test queries are provided.

  • 7/24/2019 1 Text Mining Review Slides.pptx

    50/78

    *=-H eKealth #L14

    Important 0ates *=-H#L14 =a registration opens ov #L1)

    Tas& data release egins ov. 13 #L1)

    6articipant sumission deadline$ nalsumission to e evaluated +ay L1 #L14

    esults released

  • 7/24/2019 1 Text Mining Review Slides.pptx

    51/78

    *o==

    %verview *o==, the *onference on atural =anguage =earning is a yearly

    meeting of Special Interest Nroup on ature =anguage =earning!SIN==" of the (ssociation for *omputational =inguistics !startedfrom 1DD:".

    Since 1DDD, *o== has included a shared tas& in which training andtest data is provided y the organi/ers which allows participatingsystems to e evaluated and compared in a systematic way.0escription of the systems and evaluation of their performancesare presented oth at the conference and in the proceedings.

    The last *o== was held in (ugust #L1), in Soa, 8ulgaria, -urope.Information aout *o== #L14 and its shared tas& will e releasedin next month.

    31

  • 7/24/2019 1 Text Mining Review Slides.pptx

    52/78

    *o==

    ecent shared tas&s from *o==

    3#

    ear Ta%/ Data Lan"'a"e

    #L1) Nrammatical -rror *orrectionational Oniversity of Singapore*orpus of =earner -nglish!OS*=-"

    -nglish

    #L1#+odeling +ultilingual Onrestricted*oreference in %ntootes

    %ntootes dataset from=inguistic 0ata *onsortium

    (raic,*hinese,-nglish

    #L11+odeling Onrestricted *oreferencein %ntootes

    %ntootes dataset from=inguistic 0ata *onsortium

    -nglish

    #L1L

    Sutas& ($ =earning to detectsentences containing uncertaintySutas& 8$ =earning to resolve thein7sentence scope of hedge cues

    ($ iological astracts and fullarticles fromthe 8ioScope !iomedicaldomain" corpus8$ paragraphs from 9i&ipedia

    possilycontaining weasel information

    -nglish

    #LLDSyntactic and Semantic0ependencies in +ultiple=anguages

    0ata with gold standardannotation of syntacticdependency, type ofdependency, frame, role set andsense in multiple languages

    -nglish,*atalan,*hinese,*/ech,Nerman,

  • 7/24/2019 1 Text Mining Review Slides.pptx

    53/78

    ]S-+

    %verview

  • 7/24/2019 1 Text Mining Review Slides.pptx

    54/78

    ]S-+

    ]S-+ #L1# shared tas&$ 0escription$ esolving the scope and the focus of

    negation

    0ata$ Stories y *onan 0oyle, and 9S< 6rop8an& 0ata

    !aout C,LLL sentences in total". (ll occurrences ofnegation, their scope and focus are annotated.

    ]S-+ #L1) shared tas&$ 0escription$ *reate a unied framewor& for the

    evaluation of semantic textual similarity modules andcharacteri/e their impact on =6 applications.

    The data covers 3 areas$ paraphrase sentence pairs!+Spar", sentence pairs from video descriptions!+Svid", +T evaluation sentence pairs !+Tnews and

    +Teuroparl" and gloss pairs !%n9". 34

  • 7/24/2019 1 Text Mining Review Slides.pptx

    55/78

    8io=6

    %verview 8io=6 shared tas&s are organi/ed y the (*=Bs

    special Interest Nroup for iomedical naturallanguage processing.

    8io=6 #L1) was the twelfth wor&shop oniomedical natural language processing and held inconAunction with the annual (*= or ((*= meeting.

    8io=6 shared tas&s are i7annual event held withthe 8io=6 wor&shop started from #LLD. The nextevent will e held in #L13.

    33

  • 7/24/2019 1 Text Mining Review Slides.pptx

    56/78

    8io=6 6ast Shared Tas&sVear Tas& 0ata

    eleased0ate

    -nd 0ate

    #L1)

    1. Nenia -vent -xtraction from H&8>nowledge ase construction

    H>8 >nowledge ase %ct. #L1# (pr. #L1)

    #. *ancer Nenetics 6u+ed =iterature). 6athway *uration 6u+ed astracts4. *orpus (nnotation with Neneegulation %ntology

    6u+ed =iterature

    3. 8acteria 8iotopes

    9epage documents

    with generalinformation aoutacteria species

    :. Nene egulation etwor& in 8acteria 6u+ed (stracts

    #L11

    1. N-I( 6u+ed astracts0ec.

    #L1L(pr. #L11

    #. -pigenetics and 6ost7translational+odications

    6u+ed astracts

    ). Infectious 0iseases 6u+ed astracts4. 8acteria 8iotopes 6u+ed astracts3. 8acteria Interactions 6u+ed astracts. *o7reference 6u+ed astracts:. NeneE6rotein -ntity elations 6u+ed astractsC. Nene renaming 6u+ed astracts

    #LLD

    1. core event extraction!identify events

    concerningwith the given proteins "

    6u+ed astracts0ec. 13

    #LLC

    +ar. )L

    #LLD#. -vent enrichment 6u+ed astracts 3

  • 7/24/2019 1 Text Mining Review Slides.pptx

    57/78

    i## *hallenges

    Informatics for Integrating 8iology and the 8edside!i##" is an IK funded ational *enter for 8iomedical*omputing !*8*".

    I## center organi/es data challenges to motivate thedevelopment of scalale computational framewor&s toaddress the ottlenec& limiting the translation ofgenomic ndings and hypotheses in model systemsrelevant to human health.

    I## challenge wor&shops are held in conAunction with(nnual +eeting of (merican +edical Informatics(ssociation.

    3:

  • 7/24/2019 1 Text Mining Review Slides.pptx

    58/78

    6revious i## *hallenges

    ear Ta%/ DataReea%e

    DateEn!Date

    #L1# Temporal relation extraction -K

  • 7/24/2019 1 Text Mining Review Slides.pptx

    59/78

    A88LING TEXT MINING IN HEALTH

    SOCIAL MEDIA RESEARCH:AN EXAM8LE

    3D

    - t ti (d 0 - t f

  • 7/24/2019 1 Text Mining Review Slides.pptx

    60/78

    -xtracting (dverse 0rug -vents fromKealth Social Horums

    %nline patient forums can provide valuale supplementaryinformation on drug eFectiveness and side eFects. Those forums cover ar"e an! !iver%e &o&'ationand contain !ata

    !ire#t- from &atient%.

    6atient forum (0- reports can serve as an economical alternative toexpensive and time7consuming patient7oriented drug safety data

    collection proAects. It can help to "enerate ne) #ini#a $-&ot$e%i%, #ro%%9vai!ate

    the adverse drug events detected from other data sources, andconduct comparison studies.

    L

    Post ID Post Content Contain

    ADE?

    Report

    source

    9043 I had horriblechest pain [Event]underActos [Treatment]. ADE Patient

    12200 From what you have said, it seems thatLantus[Treatment]has had some negative sideeffects related todepression [Event]andmood swings [Event].

    ADE Hearsay

    25139 I never experiencedfatigue [Event]when usingZocor [Treatment]. Negated

    ADE

    Patient

    34188 When takingZocor [Treatment], I hadheadaches [Event]andbruising [Event]. ADE Patient

    63828 Another study of people with multiple risk factors forstroke [Event]found thatLipitor

    [Treatment]reduced the risk ofstroke [Event]by 26% compared to those taking a

    placebo, the company said.

    Drug

    Indication

    Diabetes

    research

  • 7/24/2019 1 Text Mining Review Slides.pptx

    61/78

    Test 8ed

    1

    Horum ameumerof 6osts

    umer ofTopics

    umer of+emer 6roles

    Time SpanTotal umerof Sentences

    (merican 0iaetes(ssociation

    1C4,C:4 #,LC4 ,344#LLD.#7

    #L1#.111,)4C,)4

    0iaetes Horums 3C,C4 43,C)L 1#,L:3#LL#.#7

    #L1#.11),)L),CL4

    0iaetes Horum :,444 ,4:4 ),LL:

    #LL:.#7

    #L1#.11 4##,)33

    0iscussionaout diseaseand medical

    prolems

    0iscussion aoutdisease monitoring

    and medicalproducts

    - t ti (d 0 - t f

  • 7/24/2019 1 Text Mining Review Slides.pptx

    62/78

    -xtracting (dverse 0rug -vents fromKealth Social Horums

    *hallenges Topics in patient social media cover various sources, including

    news and research, hearsay!stories of other people" andpatients experience. edundant and noisy information oftenmas&s patient7experienced (0-s.

    *urrently, extracting adverse event and drug relation in patientcomments results in low precision due to confounding with drugindications!=egitimate medical conditions a drug is used for "and negated ADE!contradiction or denial of experiencing (0-s"in sentences.

    Solutions 0evelop relation extractor for recogni/ing and extracting

    adverse drug event relations.

    0evelop a text classier to extract adverse drug event reportsased on patient experience.

    #

    - tracting (d erse 0rug - ent from

  • 7/24/2019 1 Text Mining Review Slides.pptx

    63/78

    -xtracting (dverse 0rug -vent fromKealth Social Horums

    )

    8atient 2or'm Data Coe#tion: collect patient forum data through a we crawler

    Data 8re&ro#e%%in"$ remove noisy text including O=, duplicated punctuation, etc,separate post to individual sentences.

    Me!i#a entit- e.tra#tion$ identify treatments and adverse events discussed inforum

    A!ver%e !r'" event e.tra#tion$ identify drug7event pairs indicating an adversedrug event ased on results of medical entity extraction

    Re&ort %o'r#e #a%%i(#ation$ classify the source of reported events either frompatient experience or hearsay

  • 7/24/2019 1 Text Mining Review Slides.pptx

    64/78

    +edical -ntity -xtraction

    4

    Initiali/e the medical entityextraction with MetaMa&to match terms related todrugs and (0-s in forumdiscussion.

    Hilter the terms extractedy +eta+ap that neverappear in 2AERSreports.

    Puery Con%'mer Heat$?o#a*'ar-for consumerpreferred terms of theentities extracted y+eta+ap and loo& up thoseconsumer vocaularies in

    the discussions.

    +eta+apis a

  • 7/24/2019 1 Text Mining Review Slides.pptx

    65/78

    (dverse 0rug -vent-xtraction

    3

    >ernel ased statistical learningHeature generation

    Nenerate representations of the relationinstances

    Syntactic and semantic classesmapping

    *ategori/e lexical features into syntactic andsemantic classes to reduce the featuresparsity

    Shortest dependency path &ernel

    *ompute the similarity score etween tworelation instances

    Semantic ltering0rug indications from H(-S

    Incorporate medical domain &nowledgefor diFerentiating drug indication fromadverse events

    eg-^Incorporate linguistic &nowledge toidentify negated adverse drug events.

    Semantic templatesHorm ltering templates using the&nowledge from H(-S and eg-^.

    ule ased classication

    (dverse 0rug -vent

    https://code.google.com/p/negex/https://code.google.com/p/negex/
  • 7/24/2019 1 Text Mining Review Slides.pptx

    66/78

    (dverse 0rug -vent-xtraction

    Heature generation 9e utili/ed the Stanford 6arser !http$EEnlp.stanford.eduEsoftwareEstanford7

    dependencies.shtml" for dependency parsing.

    The gure aove shows the dependency tree of a sentence. In this sentence,hypoglycemia is an adverse event and =antus is a diaetes treatment.Nrammatical relations etween words are illustrated in the gure. Hor instance,bcauseB and bhypoglycemiaB have a relation bdoAB as bhypoglycemiaB is the directoAect of bcauseB. In this relation, bcauseB is the governor and bhypoglycemiaB is thedependent.

    (dverse 0rug -vent

  • 7/24/2019 1 Text Mining Review Slides.pptx

    67/78

    (dverse 0rug -vent-xtraction

    :

    Syntactic and Semantic *lasses +apping

    To reduce the data sparsity and increase the roustness of our method,we expand shortest dependency path y categori/ing words on thepath into syntactic and semantic classes with varying degrees ofgenerality.

    9ord classes include part7of7speech !6%S" tags and generali/ed 6%S tags.

    6%S tags are extracted with Stanford *ore=6 pac&ages. 9e generali/edthe 6%S tags with 6enn Tree 8an& guidelines for the 6%S tags. Semantictypes !-vent and Treatments" are also used for the two ends of the shortestpath.

    Syntactic and Semantic *lasses +apping fromdependency graph

    The relation instance in the gure aove is represented as a sequence of features^x1,x#,x),x4,x3,x,x:J,where x1Kypoglycemia, , oun, -vent, x#7`, x)cause, ?8, ?er, x4_7,x3action, , oun, x_7, x:=antus, , oun, Treatment.

    (dverse 0rug -vent

  • 7/24/2019 1 Text Mining Review Slides.pptx

    68/78

    (dverse 0rug -vent-xtraction

    C

    Shortest 0ependency 6ath >ernel function

    If xx1x#xmand yy1y#..ynare two relation examples,where xi denotes the set of word classes corresponding toposition i, the &ernel function is computed as in equationelow !8unescu et al. #LL3".

    is the numer of common word classes etween xiand yi.

    elation instance ^Kypoglycemia, NN, No'n, Event, 9B, cause, ?8,?er*, 9, action, NN, No'n, 9, =antus, NN, No'n, TreatmentJ.

    elation instance ydepression,

    NN, No'n, Event,

    9B, indicate, ?86,

    ?er*, _7, eFect, NN, No'n, 9, =antus, NN8, No'n, TreatmentJ.

    >!x,y" can e computed as the product of the numer of commonfeatures xi and yi in position i.

    >!x,y")]1]1]1]#]1])1C.

    ||),( iiii yxyxC =

    (dverse 0rug -vent

  • 7/24/2019 1 Text Mining Review Slides.pptx

    69/78

    (dverse 0rug -vent-xtraction

    D

    S?+ *lassication There are a lot of S?+ softwareEtools have een

    developed and commerciali/ed.

    (mong them, S?+7light pac&age and =I8S?+ are

    two of the most widely used tools. 8oth are freeof charge and can e downloaded from theInternet. S?+7light is availale at http$EEsvmlight.Aoachims.orgE

    =I8S?+ can e found athttp$EEwww.csie.ntu.edu.twEcAlinElisvmE

    (dverse 0rug -vent

  • 7/24/2019 1 Text Mining Review Slides.pptx

    70/78

    (dverse 0rug -vent-xtraction

    S?+7light

    :L

    (dverse 0rug -vent

  • 7/24/2019 1 Text Mining Review Slides.pptx

    71/78

    (dverse 0rug -vent-xtraction

    :1

    ALGORITHM 1ST(TISTI*(= =-(IN H% (0?-S- 0ON -?-T-^T(*TI%

    In&'t:all the relation instances with a pair of related drug and medicalevents, R(drug, event!

    O't&'t:whether the instances have a pair of related drug and event

    8ro#e!'re:41 2or ea#$ relation instance R(drug,event :

    Nenerate 0ependency tree T of R(drug,event

    Heatures Shortest Dependency "ath Extraction !T, R"

    Heatures Syntactic and Semantic #lasses $apping !Heatures"

    #. Separate relation instances into training set and test set

    ). Train a S?+ classier * with shortest dependency kernel %unction ased onthe training set

    4. Ose the S?+ classier * to classify instances in the test set into two classesR(drug, event True and R(drug, event Halse.

    (dverse 0rug -vent

  • 7/24/2019 1 Text Mining Review Slides.pptx

    72/78

    (dverse 0rug -vent-xtraction

    :#

    ALGORITHM 1S-+(TI* HI=T-IN (=N%ITK+In&'t:a relation instance i with a pair of related drug and

    medical events, R(drug, event!

    O't&'t:The relation type.

    Ifdrug exists inH(-S$

    &etindication list fordrugM

    2orindication inindication list$

    Ifevent indication$

    Ret'rnR(drug, event b0rug IndicationBM

    2orrule ineg-^$Ifrelation instance i matchesrule$

    Ret'rnR(drug, event begated (dverse 0rug -ventBM

    Ret'rn R(drug, event b(dverse 0rug -ventBM

  • 7/24/2019 1 Text Mining Review Slides.pptx

    73/78

    eport Source *lassication

    In order to classify the report source of adversedrug events, we developed a feature7asedclassication model to distinguish patient reportsfrom hearsay ased on the prior studies.

    9e adopted 8%9 features and TransductiveSupport ?ector +achines in S?+7light forclassication.

    :)

    -valuation on +edical -ntity

  • 7/24/2019 1 Text Mining Review Slides.pptx

    74/78

    -valuation on +edical -ntity-xtraction

    The performance of our system !H7measure" surpasses the estperformance in prior studies ! H7measure:).DR ", which is achieved

    y applying O+=S and +ed-Fect to extract adverse events from0ailyStrength !=eaman et al., #L1L". There may e several causesfor our approach to outperform prior wor&. *omination of multiple lexicons improves precision.

    0ailyStrength is a general health social wesite where users may have morediverse health vocaulary and develop more linguistic creativity. -xtracting

    medical named entities could e more dicult than our data source. :4

    D).DR

    C:.)R

    D#.3R

    C.3R

    D1.4R

    C3.4R

    D1.:R

    CL.)R

    DL.CR

    CL.:R

    DL.3R

    :D.3R

    D#.3R

    C).3R

    D1.R

    C).3R

    DL.DR

    C#.)R

    Re%'t% of Me!i#a Entit- E.tra#tion

    6recision ecall f7measure

    -valuation on (dverse 0rug -vent

  • 7/24/2019 1 Text Mining Review Slides.pptx

    75/78

    -valuation on (dverse 0rug -vent-xtraction

    *ompared to co7occurrence ased approach !*%", statistical learning

    !S=" contriuted to the increase of precision from around 4LR toaove LR while the recall dropped from 1LLR to around LR. H7measure of S= is etter than *% method.

    Semantic ltering !SH" further improved the precision in extractionfrom LR to aout CLR y ltering drug indications and negated(0-s.

    :3

    )C.3R

    #.LR

    C#.LR

    44.CR

    4.#R

    :C.R

    41.3R

    #.3R

    :3.#R

    1LL.LR

    3.3R 3.R

    1LL.LR

    L.4R L.4R

    1LL.LR

    3C.LR 3C.LR33.R 3D.#R.DR

    1.DR #.#R C.)R

    3D.R L.#R 3.3R

    *% S= S=@SH *% S= S=@SH *% S= S=@SH

    (merican0iaetes(ssocia on 0iaetesHorums 0iaetesHorum

    Re%'t%ofA!ver%eDr'"EventE.tra# on

    6recision .ecall H7measure

    -valuation on eport Source

  • 7/24/2019 1 Text Mining Review Slides.pptx

    76/78

    -valuation on eport Source*lassication

    9ithout report source classication !S*", the performance of extraction is

    heavily aFected y noise in the discussion. The precision ranged from 31R to #R without S*.

    %verall performance !H7measure" ranged from CR to :R

    (fter report source classication, the precision and H7measure signicantlyimproved. The precision increased from 31R up to C4R

    The overall performance !H7measure " increased from CR to aove CLR.

    :

    1.3R

    C).DR

    3#.:R

    C1.#R

    31.4R

    CL.#R

    1LL.LR

    C4.)R

    1LL.LR

    C).1R

    1LL.LR

    C#.4R:.#R

    C4.1R

    D.LR

    C#.1R

    :.DR

    C1.)R

    Re%'t% of Re&ort So'r#e Ca%%i(#ation

    6recision ecall H7measure

    *ontrast of %ur 6roposed Hramewor& to *o

  • 7/24/2019 1 Text Mining Review Slides.pptx

    77/78

    *ontrast of %ur 6roposed Hramewor& to *o7occurrence ased approach

    ::

    There are a large numer of false adverse drug events whichcouldnBt e ltered out y co7occurrence ased approach.8ased on our approach , only )3R to 4LR of all the relationinstances contain adverse drug events.(mong them, aout 3LR comes from patient reports.

    (merican 0iaetes (ssociation 0iaetes Horums 0iaetes Horum

    1LLR 1LLR 1LLR

    )3.D:R ):.DCR )D.#:R

    #1.D4R 1D.:4R 1C.1LR

    Contra%t of O'r 8ro&o%e! 2rame)or/ to Co9o##'rren#e *a%e! a&&roa#$

    Total elation Instances (dverse 0rug -vents 6atient eported (0-s

    #D:# 1LD 3# F

    1)C: 4 45

    J4 4J

    f

  • 7/24/2019 1 Text Mining Review Slides.pptx

    78/78

    eferences

    ]S-+$http$EEixa#.si.ehu.esEstarsemE *o==$http$EEifarm.nlEsignllEconllE

    Sem-val$http$EEalt.qcri.orgEsemeval#L14E

    *=-H eKealth$ http$EEclefehealth#L14.dcu.ieEhome

    8io=6$http$EE#L1).ionlp7st.orgE

    I##$https$EEwww.i##.orgE

    8enton (., Ongar =., Kill S., Kennessy S., +ao ernel for elation -xtraction. In$ 6roceedings of theconference on Kuman =anguage Technology and -mpirical +ethods in atural =anguage 6rocessing, pp. :#47:)1.

    *hee 8. 9., 8erlin ., \ Schat/ 8. !#L11". 6redicting adverse drug events from personal health messages. In$ (+I((nnual Symposium 6roceedings ?ol. #L11, pp. #1:7##

    *ulotta, (., \ Sorensen,