105
Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor University of Michigan CSE, SI, LING [email protected]

Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Embed Size (px)

Citation preview

Page 1: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Lexical networks, lexical centrality, and text mining

Computational LinguisticsAndInformationRetrieval @ Michigan

Dragomir R. RadevAssociate Professor

University of Michigan CSE, SI, [email protected]

Page 2: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

The CLAIR gang• This talk is based on joint work with

– Vahed Qazvinian– Joshua Gerrish

• With contributions by– Arzucan Özgür– Güneş Erkan– Ahmed Awadallah– Bryan Gibson– Thuy Vu– Xiaodong Shi– Mark Joseph– Sean Gerrish– Alejandro C de Baca– Jahna Otterbacher– Benjamin Nash– Alex Gonopolskiy

Page 3: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Natural Language Processing

Page 4: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Social data

• Blog postings• News stories• Speeches in Congress• Query logs• Movie and book reviews• Scientific papers• Financial reports• Query logs• Encyclopedia entries• Email• Chat room discussions• Social networking sites

Page 5: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Social data

• Blog postings• News stories• Speeches in Congress• Query logs• Movie and book reviews• Scientific papers• Financial reports• Query logs• Encyclopedia entries• Email• Chat room discussions• Social networking sites

WHAT DO ALL OF THESE HAVE IN COMMON?

Page 6: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Natural language processing

• Part of speech tagging• Prepositional phrase attachment• Parsing• Word sense disambiguation• Document indexing• Text summarization• Machine translation• Question answering• Information retrieval• Social network extraction• Topic modeling

Page 7: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 8: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 9: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 10: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 11: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Talk outline

• Lexical networks

• Lexical centrality

• Latent networks

• Conclusion

Page 12: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Networks

Page 13: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 14: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Peri et al., Nucleic Acids Res. 2004 January 1; 32(Database issue): D497–D501. doi: 10.1093/nar/gkh070.

Interleukin-2 receptor pathway protein interaction network (from HPRD).

Page 15: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

The New York Times

May 21, 2005

Page 16: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Lexical networks

Page 17: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 18: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 19: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 20: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 21: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 22: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Lexical networks

• A special case of networks where nodes are words or documents and edges link semantically related nodes

• Other examples:– Words used in dictionary definitions– Names of people mentioned in the same story– Words that translate to the same word

Page 23: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Semantic network

Page 24: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Meredith yesterday apples

bought

green

Dependency network

Page 25: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Dependency network

Page 26: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Lexical Centrality

Page 27: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 28: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

What happened?

• Red Sox Are World Champs (again)

Page 29: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 30: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Red Sox Win the World Series, More Titles Might FollowBack in the Red: World Series crown returns to Boston as Rockies hit the canvasFans celebrate Red Sox winRockies Vanish In Thin AirPolice Arrest Dozens After Red Sox World Series WinRed Sox 4, Rockies 3 Boston Sweeps World Series AgainVictory walk leads to dynasty talkRed Sox Win Baseball's World Series Title by Sweeping RockiesBoston enjoys sweep smell of successSox sweep Rockies to win SeriesWorld Series: Red Sox sweep RockiesRed Sox cruise to World Series titleRed Sox Are World ChampsBoston sweep Colorado to win World SeriesRed Sox Sweep Colorado in World SeriesRed Sox Take World SeriesHow sweep it is! Red Sox breeze to second World Series titleBoston owners dedicate triumph to Red Sox Nation

How many ways to say it?

Page 31: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Red Sox Win the World Series, More Titles Might FollowBack in the Red: World Series crown returns to Boston as Rockies hit the canvasFans celebrate Red Sox winRockies Vanish In Thin AirPolice Arrest Dozens After Red Sox World Series WinRed Sox 4, Rockies 3 Boston Sweeps World Series AgainVictory walk leads to dynasty talkRed Sox Win Baseball's World Series Title by Sweeping RockiesBoston enjoys sweep smell of successSox sweep Rockies to win SeriesWorld Series: Red Sox sweep RockiesRed Sox cruise to World Series titleRed Sox Are World ChampsBoston sweep Colorado to win World SeriesRed Sox Sweep Colorado in World SeriesRed Sox Take World SeriesHow sweep it is! Red Sox breeze to second World Series titleBoston owners dedicate triumph to Red Sox NationRed Sox win World SeriesShort wait for bosox this timeWorld Series: Red Sox complete sweep of RockiesIt's Leap YearBoston Red Sox blank Rockies to clinch World SeriesRed Sox Sweep Rockies 4-3 In Game 4Red Sox claim World Series titleBoston Red Sox win World Series, lose imageSox sweep Rockies for 2nd title in 4 seasonsRed Sox Complete Sweep Of Rockies For World Series VictoryRed sox wrap up world series routRockies feel the pain, but not the shameRed Sox, tarnished for so long, now baseball's gold standardRed sox take titleBoston Celebrates World Series WinBad news AL, Red Sox built to lastBoston sweeps it Heartbreak is history for Red SoxFans Celebrate Red Sox World Series WinFrom cursed to charmed: Red Sox sweep World SeriesRed Sox Sweep Rockies To Win World SeriesRed Sox make Boston jump for joy, Series ChampsCrowds fill streets after Red Sox winPapelbon, Timlin savoring Series winRed Sox scale the RockiesEven Sox fan losing passion for winningRookies respond in first crack at the big timeRed Sox Get 2nd World Series Sweep In 4 YearsRed Sox looking like a dynastyBoston Red Sox are America's teamRed Sox Win 2007 World SeriesBelieving Pays Off! Sox World Series ChampsRed Sox go from cursed to blessed

Boston lowers the broomWe Are the Champions: Red Sox 4, Rockies 3Sweep and red sox for everybodyTwo titles four years apart impossible to compareRed Sox cash inThe Boston Red Sox swept the Colorado Rockies to win the World SeriesRed Sox "do little things" to win second title in four yearsWorld Series victory for Red SoxBoston reigns supremeRockies' heads held high despite lossRockies just failed to executeBoston Fans Fill Streets To Celebrate Series SweepIt's easy to embrace these Red SoxYoung stars lead Red Sox to World Series titleThey spend money in Boston, but they winBoston fans celebrate World Series win; 37 arrests reportedSoxcess started upstairsBoston sweep is completeRed Sox sweep 2007 World Series in DenverRookies rise to occasion!Sox sweep World Series againPoor pitching, poorer hitting doom RockiesRed Sox top Rockies and sweep to second World Series championship ...Sweeping off to BostonAnother in a Series of sweepsPutting an end to baseball as we've known itRed Sox Sweep Rockies to Capture Second Title in Four YearsRed Sox accomplished the expected, unbelievableRockies Find Being Good Isnt EnoughBoston sweep-walks to titleRed Sox play party crashers in DenverUnhappy ending for Colorado - MLBSox on, Rocks off in sweeping winTimlin gets to ring up another oneRed Sox complete World Series sweepWild celebrations in Boston after World Series winRed Sox seal sweep of RockiesRed Sox Win World SeriesSox are kings of diamondRockies: Sweep, sweep, sweptRed Sox sweep World SeriesMonsters of Beantown: Red Sox win SeriesRed Sox claim World Series gloryHow sweep it isRed Sox: Dynasty in the makingRed Sox sweep upstart RockiesBoston Sweeps Colorado To Win World SeriesRed Sox sweep Rockies, take World SeriesWhat curse? Red Sox win Series again

Page 32: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

• Red Sox• Sweep• World Series• Rockies• Celebrate• Boston• Colorado• Dynasty• 2007• Four years• Second time• 4:3 score• Easy• Expectations• No curse• Young players• Timlin• not (baseball)

List of topics

Page 33: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

LexRank – Centrality in Text Graphs

Vertices

Units of text (sentences or documents)

Edges

Pairwise similarity between text

(tf-idf cosine or language model)

Page 34: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

LexRank – Centrality in Text Graphs

Intuition

LexRank score is propagated through

edges

Central vertices are those that are similar

to other central vertices

Page 35: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

LexRank – Centrality in Text Graphs

Recurrence Relation

sCan guarantee solution by allowing “jump” probability

d/N.

0.5

0.3

0.80.2

0.1

0.3

0.9

0.2 0.4

Page 36: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 37: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 38: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

0.01718 Red Sox Win Baseball's World Series Title by Sweeping Rockies 0.01712 Red Sox Sweep Rockies To Win World Series 0.01647 World Series: Red Sox sweep Rockies 0.01630 Red Sox sweep Rockies, take World Series 0.01608 Red Sox 4, Rockies 3 Boston Sweeps World Series Again 0.01597 World Series: Red Sox complete sweep of Rockies 0.01584 Red Sox sweep World Series 0.01579 Red Sox Sweep Colorado in World Series 0.01573 Red Sox Complete Sweep Of Rockies For World Series Victory 0.01531 Red Sox complete World Series sweep ... 0.01057 Boston Red Sox blank Rockies to clinch World Series 0.01052 Red Sox: Dynasty in the making 0.01037 Sox sweep Rockies for 2nd title in 4 seasons 0.01034 Police Arrest Dozens After Red Sox World Series Win 0.01027 Rookies respond in first crack at the big time 0.01018 Rockies: Sweep, sweep, swept 0.01016 Sweeping off to Boston 0.01013 Rookies rise to occasion! 0.01012 Fans celebrate Red Sox win 0.01010 Short wait for bosox this time ... 0.00441 Sox are kings of diamond 0.00414 Rockies just failed to execute 0.00408 Rockies Find Being Good Isnt Enough 0.00407 Rockies' heads held high despite loss 0.00391 Boston lowers the broom 0.00390 Rockies Vanish In Thin Air 0.00390 Poor pitching, poorer hitting doom Rockies 0.00390 Rockies feel the pain, but not the shame 0.00375 Two titles four years apart impossible to compare 0.00362 Boston reigns supreme

Page 39: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

0.01718 Red Sox Win Baseball's World Series Title by Sweeping Rockies 0.01712 Red Sox Sweep Rockies To Win World Series 0.01647 World Series: Red Sox sweep Rockies 0.01630 Red Sox sweep Rockies, take World Series 0.01608 Red Sox 4, Rockies 3 Boston Sweeps World Series Again 0.01597 World Series: Red Sox complete sweep of Rockies 0.01584 Red Sox sweep World Series 0.01579 Red Sox Sweep Colorado in World Series 0.01573 Red Sox Complete Sweep Of Rockies For World Series Victory 0.01531 Red Sox complete World Series sweep ... 0.01057 Boston Red Sox blank Rockies to clinch World Series 0.01052 Red Sox: Dynasty in the making 0.01037 Sox sweep Rockies for 2nd title in 4 seasons 0.01034 Police Arrest Dozens After Red Sox World Series Win 0.01027 Rookies respond in first crack at the big time 0.01018 Rockies: Sweep, sweep, swept 0.01016 Sweeping off to Boston 0.01013 Rookies rise to occasion! 0.01012 Fans celebrate Red Sox win 0.01010 Short wait for bosox this time ... 0.00441 Sox are kings of diamond 0.00414 Rockies just failed to execute 0.00408 Rockies Find Being Good Isnt Enough 0.00407 Rockies' heads held high despite loss 0.00391 Boston lowers the broom 0.00390 Rockies Vanish In Thin Air 0.00390 Poor pitching, poorer hitting doom Rockies 0.00390 Rockies feel the pain, but not the shame 0.00375 Two titles four years apart impossible to compare 0.00362 Boston reigns supreme

Page 40: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

NLP and network analysis

Page 41: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Dependency parsing

Page 42: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

<root>

John

likes

green

apples

<period>

<root>

John/likes

green

apples

<period>

<root>

John/likes/apples

green

<period>

<root>

John/likes/apples/green

<period>

<root>

John/likes/apples/green/<period>

<root>

John

likes

green

apples

<period>

[McDonald et al. 2005]

Page 43: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

... , sagte der Sprecher bei der Sitzung .... , rief der Vorsitzende in der Sitzung .

... , warf in die Tasche aus der Ecke .

C1: sagte, warf, riefC2: Sprecher, Vorsitzende, TascheC3: inC4: der, die

[Biemann 2006] [Mihalcea et al 2004] [Mihalcea et al 2004]

[Widdows and Dorow 2002][Pang and Lee 2004]

Part of speech tagging Word sense disambiguation Document indexing

Subjectivity analysis Semantic class induction

Q

relevanceinter-similarity

Passage retrieval

[Otterbacher,Erkan,Radev05]

Page 44: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

MavenRank – Centrality in Speech Graphs

Vertices

Speech transcripts from a given topic

Edges

tf-idf cosine similarity (with threshold)

Hypothesis

Key speakers will have speeches with high centrality.

Page 45: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

MavenRank: Example

23

1

87

6

4

5

Speaker 1Speeches

Speaker 2Speeches

Speaker 3Speeches

Speech Scores

1 0.132 0.133 0.104 0.195 0.106 0.147 0.088 0.13

Speaker Scores (mean speech score)

1 0.122 0.153 0.12

Page 46: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 47: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 48: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Joint work with Kevin Quinn, Burt Monroe, Michael Colaresi and Michael Crespin

Gosnell Prize 2005

Page 49: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Feature Extraction from Dependency Trees

Path1: KaiC – nsubj – interacts – obj – SasA

Path2: KaiC – nsubj – interacts – obj – SasA – conj_and – KaiA

Path3: KaiC – nsubj – interacts – obj - SasA – conj_and – KaiB

Path4: SasA – conj_and – KaiA

Path5: SasA – conj_and – KaiB

Path6: KaiA - prep_with - SasA – conj_and – KaiB

“The results demonstrated that KaiC interacts rhythmically with KaiA, KaiB, and SasA.”

Page 50: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Path Edit Kernel Character–based --> modified as word-based

Minimum number of operations (insertion, deletion, or substitution of a single word) to transform the first string to the second

Ex: 1. KaiC - subj - interacts - obj - SasA - conj - KaiA 2. KaiC - subj - interacts - obj - SasA - conj – KaiA

Edit distance = 2 (2 insertions) Normalize edit distance: divide to the length of the longer path

2/7 = 0.286

Integrate SVM with path edit kernel Higher performance than results reported so far in the

literature

Page 51: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

AIMED

Best F-score (59.96%) --> TSVM with Path Edit Kernel (higher than previously reported results)

Page 52: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Inferring Genes Related to Prostate Cancer Hypothesis:

Genes that are interacting with many genes that are known to be related to prostate cancer are likely to be related to prostate cancer

Approach: Extract the interaction network of genes (seed genes) that are known

to be related to prostate cancer automatically from the literature Infer new genes related to prostate cancer from the network topology Use eigenvalue centrality to rank gene-prostate cancer associations

Hypothesis restatement: Genes central in the constructed network are most probably related

to prostate cancer.

Page 53: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Approach Corpus:

PMCOA (PubMed Central Open Access) – full text articles

Articles in PMCOA split into sentences and sentences tagged with

GeniaTagger

Compile seed list of genes known to be related to prostate cancer

20 genes compiled from OMIM (Online Mendelian Inheritance in

Man) Database

Extend seed gene list with synonyms from HGNC (HUGO Gene

Nomenclature Committee) database.

Use the automatic interaction extraction pipeline to extract the

interaction network of the seed genes and their neighbors (genes

interacting with the seed genes).

Page 54: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Seed Genes

Gene DescriptionAR androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease)BRCA2 breast cancer 2, early onsetMSR1 macrophage scavenger receptor 1EPHB2 EPH receptor B2KLF6 Kruppel-like factor 6MAD1L1 MAD1 mitotic arrest deficient-like 1 (yeast)TUSC3 tumor suppressor candidate 3HIP1 huntingtin interacting protein 1CBX8 chromobox homolog 8 (Pc class homolog, Drosophila)|#|chromobox homolog 8 (Drosophila Pc class)CD82 CD82 moleculeZFHX3 zinc finger homeobox 3ELAC2 elaC homolog 2 (E. coli)MXI1 MAX interactor 1PTEN phosphatase and tensin homolog (mutated in multiple advanced cancers 1)RNASEL ribonuclease L (2',5'-oligoisoadenylate synthetase-dependent)HPC1 hereditary prostate cancer 1CHEK2 CHK2 checkpoint homolog (S. pombe)HPCX hereditary prostate cancer, X-linked predisposing for prostate cancerPCAP predisposing for prostate cancer PRCA1 prostate cancer 1

20 genes that are reported in OMIM to be related to prostate cancer

Page 55: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Interactions of the seed genes(gene names normalized to their HGNC symbols)

Page 56: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Sample Extracted Interaction Sentences A study by Jin et al. [20] indicated that the association of Tax with hsMAD1, a mitotic spindle

checkpoint (MSC) protein, led to the translocation of both MAD1 and MAD2 to the cytoplasm.

PTEN is transcriptionally regulated by transcription factors such as p53, Egr-1, NFκB and SMADs,

while protein levels and activity are modulated by phosphorylation, oxidation, subcellular

localisation, phospholipid binding and protein stability [29].

Interestingly, one of these, HPC1, is linked to RNASEL [10,11].

In response to DNA damage, the cell-cycle checkpoint kinase CHEK2 can be activated by ATM

kinase to phosphorylate p53 and BRCA1, which are involved in cell-cycle control, apoptosis, and

DNA repair [1,2].

The interactions of RAD51 with TP53, RPA and the BRC repeats of BRCA2 are relatively well

understood (see Discussion).

The interaction of BRCA2 with HsRad51 is significantly more different to both RadA and RecA

(Figure 2c).

Max interactor protein, MXI1 (gene L07648) competes for MAX thus negatively regulates MYC

function and may play a role in insulin resistance.

Mad2 binds to Cdc20, an activator of the anaphase-promoting complex (APC), to inhibit APC

activity and arrest cells in metaphase in response to checkpoint activation.

Page 57: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Inferred Genes (evaluation of top-20 scoring genes)

6 are seed genes; 14 genes are inferred to be related to prostate cancer (Check GeneGo Pathway

database; if no evidence there, check PubMed literature)

9 genes: marked as being related to prostate cancer by GeneGo Pathway Database

1 gene: Found evidence in PubMed that gene related to prostate cancer

4 genes: no evidence found

Gene Description EvidenceTP53 tumor protein p53 (Li-Fraumeni syndrome) GeneGoBRCA1 breast cancer 1, early onset GeneGoEREG epiregulin noAKT1 v-akt murine thymoma viral oncogene homolog 1 GeneGoMAPK1 mitogen-activated protein kinase 1 noTNF tumor necrosis factor (TNF superfamily, member 2) GeneGoCCND1 cyclin D1 GeneGoMYC v-myc myelocytomatosis viral oncogene homolog (avian) GeneGoAPC adenomatosis polyposis coli PubMedCDKN1B cyclin-dependent kinase inhibitor 1B (p27, Kip1) GeneGoMAPK8 mitogen-activated protein kinase 8 GeneGoNR3C1 nuclear receptor subfamily 3, group C, member 1 (glucocorticoid receptor) noVEGFA vascular endothelial growth factor A GeneGoMDM2 mouse double minute 2, human homolog of; p53-binding protein no

Page 58: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 59: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

GIN - Article View

Interaction sentences from this article

Citation information

Page 60: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Other networks

• Diabetes Type I

• Diabetes Type II

• Bipolar Disorder

Page 61: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Properties of lexical networks

Page 62: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Dependency network

Page 63: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Random network

Page 64: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Analyzing networks

• Properties of networks– Clustering coefficient

• Watts/Strogatz cc = #triangles/#triples

– Power law coefficient – Diameter (longest shortest path)– Average shortest path (ASP)

• Properties of nodes– Centrality: degree, closeness, betweenness,

eigenvector

Page 65: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Types of networks• Regular networks

– Uniform degree distribution• Random networks

– Memoryless – Poisson degree distribution– Characteristic value– Low clustering coefficient– Large asp

• Small world networks– High transitivity– Presence of hubs (memory)– High clustering coefficient

(e.g., 1000 times higher than random)– Small ASP– Power law degree distribution

(typical value of between 2 and 3)

Npkk

kekP

kk

!)(

)()(

k

kP

Page 66: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Comparing the dependency graph to a random (Poisson) graph

Random Actual

n 5563 5584

M 14440 14472

Diameter 21 13

ASP 8.788 4.01

W/S cc 0.00062 0.092

n/a 2.19

Page 67: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Properties of lexical networks

• Entries in a thesaurus[Motter et al. 2002]

• c/c0 = 260 (n=30,000)

• Co-occurrence networks [Dorogovtsev and Mendes 2001, Sole and Ferrer i Cancho 2001]

• c/c0 = 1,000 (n=400,000)

• Mental lexicon [Vitevitch 2005]• c/c0 = 278 (n=19,340)

letter

actor

character nature

universe

world

Page 68: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

syntactic dependency degree distribution(loglog scale)

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Page 69: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Experimental data

Page 70: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 71: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 72: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Based on [Mehler 2007]

Statistics

Page 73: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 74: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 75: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 76: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 77: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 78: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 79: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 80: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 81: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 82: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Latent networks

Page 83: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 84: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 85: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 86: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 87: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 88: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 89: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 90: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 91: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Semantic similarity distributions

Page 92: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Simulations

• D = number of documents: d1…dD

• V = vocabulary size ~ Zipf()

• W = size of document in words

Page 93: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

[Teufel and van Halteren 2004]

Page 94: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 95: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 96: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Future directions

Page 97: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Machine translation1 Mr. Speaker, I rise, on this first full sitting day of the 36th Parliament, to reiterate my call to the Ontario government for an independent public inquiry into the July Plastimet fire in Hamilton. 2 Conservative Premier Mike Harris and his environment and health ministers have backtracked, flip-flopped on their pledges for an inquiry, citing the pathetic excuse of the need for evidence of wrongdoing. 3 Is it right that the local MPP had to awaken the provincial environment minister at 3 a.m. before the premier would dispatch air monitoring equipment to the toxic fire site? 4 Why did the province first refuse and then later accept federal government assistance? 5 There are questions of compliance with the Ontario fire code, inventory lists, security, and locating a recycling plant near a hospital, schools and a high density residential area. 6 Frustrated with the Harris government smokescreen, my constituents demand an independent public inquiry to clear the smoke and to produce recommendations which might prevent an environmental tragedy like the Plastimet fire from ever happening again.

1 Monsieur le Président, je profite de cette première journée complète de séance de la 36e législature pour réclamer de nouveau au gouvernement ontarien une enquête publique indépendante sur l'incendie de Plastimet, survenu à Hamilton en juillet. 2 Le premier ministre conservateur Mike Harris et ses ministres de l'Environnement et de la Santé ont fait volte-face, après s'être engagés à faire une enquête, sous le prétexte pathétique qu'il fallait des preuves de méfait. 3 Est-il normal que le député provincial de l'endroit ait dû réveiller le ministre de l'Environnement à 3 heures du matin pour que le premier ministre envoie sur les lieux un équipement de surveillance de la qualité de l'air? 4 Pourquoi la province a-t-elle d'abord refusé, avant de finir par l'accepter, l'aide du gouvernement fédéral? 5 Il y a lieu de se poser des questions sur le respect du code ontarien des incendies, les listes d'inventaire, la sécurité, et la décision d'implanter une usine de recyclage à proximité d'un hôpital, d'écoles et d'une zone résidentielle à forte densité. 6 Exaspérés par l'écran de fumée derrière lequel le gouvernement Harris se retranche, mes électeurs réclament une enquête publique indépendante pour dissiper tout ce qu'il y a de trouble dans cette affaire et formuler des recommandations qui aideront peut-être à prévenir d'autres catastrophes écologiques comme l'incendie de Plastimet.

Page 98: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

fire

1

2

3

4

5

6

incendie

premier

premier

ministre

abord

first

smokescreen

écran

fumée

Page 99: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Final notes

Page 100: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Funding sourcesBlogoCenter: Infrastructure for Collecting, Mining and Accessing Blogs

NSFjoint project with UCLA

DHB: The dynamics of Political Representation and Political RhetoricNSFjoint project with Harvard, Michigan State U. Penn. State U., U. of Georgia

National center for integrative bioinformaticsNIH

RI: iOPENER—A Flexible Framework to Support Rapid Learning in Unfamiliar Research DomainsNSFjoint project with Maryland

Representing and Acquiring Knowledge of Genome RegulationNIH (NLM)

Probabilistic and link-based Methods for Exploiting Very Large Textual Repositories NSF

Collaborative research: semantic entity and relation extraction from Web-scale text document collectionsNSF (Human Languages and Communications Program)

ITR/IM: Information Fusion Across Multiple Text Sources: A Common TheoryNSF (Information Technology Research Program)

Page 101: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

URL

• http://www.eecs.umich.edu/~radev

• http://www.clairlib.org

Page 102: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Clairlib

Page 103: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor
Page 104: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Clairlib: The Clair Library

• Native– Tokenization– Summarization– LexRank– Biased LexRank– Document Clustering– Document Indexing– PageRank– Web Graph and

Network Analysis

• Imported  – Parsing– Stemming– Sentence

segmentation– Web Page Download– Web Crawling

Page 105: Lexical networks, lexical centrality, and text mining Computational Linguistics And Information Retrieval @ Michigan Dragomir R. Radev Associate Professor

Computational LinguisticsAndInformationRetrieval @ Michigan

THANK YOU!