PartI:QualityPhraseMiningink-ron.usc.edu/xiangren/ · •TextRank [Mihalceaet al.’04] •TF-IDF 8...

Preview:

Citation preview

Constructing Structured Information Networks from Massive Text Corpora

Part I: Quality Phrase Mining

Effort-Light StructMine: Methodology

2

Data-driven textsegmentation

(SIGMOD’15, WWW’16)

Entity names& context units

Partially-labeledcorpus

Learning Corpus-specific Model(KDD’15, KDD’16,

EMNLP’16, WWW’17)

Structures fromthe remainingunlabeled data

Knowledgebases

Textcorpus

Quality Phrase Mining• Quality phrase mining seeks to extract a

ranked list of phrases with decreasing quality from a large collection of documents• Examples:

3

ScientificPapers

NewsArticles

Expected Results

USPresidentAndersonCooperBarack Obama…Obama administration…atown…

Expected Results

data miningmachinelearninginformationretrieval…support vectormachine…the paper…

Why Phrase Mining?• Phrase: Minimal, unambiguous semantic unit; basic building

block for information networks and knowledge bases• Unigrams vs. phrases

• Unigrams (singlewords)areambiguous• E.g., “United”: United States? United Airline? United Parcel Service?

• Phrase:Anatural,meaningful,unambiguous semanticunit• E.g., “United States” vs. “United Airline”

• Mining semantically meaningful phrases• Transformtextdatafromwordgranularity tophrasegranularity

• Enhancethepowerandefficiencyatmanipulatingunstructureddatausingdatabasetechnology

4

Application Scenarios

• Natural Language Processing (NLP)• Documentanalysis

• Information Retrieval (IR)• Indexinginsearchengine

• Text Mining• Keyphrases fortopicmodeling

5

What Kind of Phrases Are of “High Quality”?• Popularity

• “informationretrieval”>“cross-languageinformationretrieval”

• Concordance• “strongtea”>“powerfultea”• “activelearning”> “learningclassification”

• Informativeness• “thispaper”(frequentbutnotdiscriminative,notinformative)

• Completeness• “supportvectormachine” >“vectormachine”

6

Three Families of Methods

Supervised(linguisticanalyzers)

Unsupervised(statistical signals)

Weakly/DistantlySupervised

7

Supervised Phrase Mining• Phrase mining was originated from the NLP

community• How to use linguistic analyzers to extract phrases?

• Parsing(e.g.,stanford NLPparsers)• NounPhrase(NP)Chunking

• How to rank extracted phrases?• C-value[Frantzi etal.’00]• TextRank [Mihalcea etal.’04]

• TF-IDF

8

• Minimal Grammatical Segments ó Phrases

• Phrases: “the chef”, “the soup”

Linguistic Analyzer – Parsing

9

Rawtextsentence(string)

Fullparsetree(grammaticalanalysis)

Thechefcooksthesoup.

Full-textParsing

Inefficiencies of Parsing

• Difficult to directly apply pre-trained to new domains (e.g. twitter, biomedical, yelp)• Unlesssophisticated,manuallycurated,domain-specifictrainingdataareprovided

• Computationally slow.• Cannotbeappliedonweb-scaledatatosupportemergingapplications

• We need “shallow” phrase mining techniques

10

Linguistic Analyzer – Chunking

• Noun phrase chunking is a light version of parsing

1. Apply tokenization and part-of-speech (POS) tagging to each sentence

2. Search for noun phrase chunks

11

Drawbacks of NP Chunking

• Pre-trained models may not be transferable to new domains• Scientificdomains,querylogs,socialmedia(e.g.,Yelp,Twitter)

• Lack of the usage of corpora-level information• NPsometimescan’tmeettherequirementsofqualityphrases

12

Ranking – C-value• Given a set of phrases, for a given phrase 𝑝• 𝑓(𝑝) istherawfrequency• |𝑝| isthenumberoftokensin𝑝

• If there is no phrase contains 𝑝 as a substring• C-value(𝑝)=log) |𝑝| ⋅ 𝑓(𝑝)

• Else• C-value(𝑝)=log) |𝑝| ⋅ 𝑓 𝑝 − avg.012345267𝑓 𝑞

• Prefers “maximal” phrases• Popularity & Completeness

13

Ranking – TextRank

• Construct a network of phrases & unigrams• Compute the importance of vertices• SimilartoPageRank

• Popularity & Informativeness

14

Compatibilityofsystemsoflinearconstraintsoverthesetofnaturalnumbers.CriteriaofcompatibilityofasystemoflinearDiophantineequations,strictinequations,and

nonstrict inequations areconsidered.…..

Ranking – TF-IDF

• Term Frequency• E.g.,rawfrequency• Rewardsfrequentphrases

• Inverse Document Frequency• E.g.,log((#ofalldocuments)/(#ofoccurreddocuments))• Rewards“rare”phrases

• Popularity & Informativeness

15

Three Families of Methods

Supervised(linguisticanalyzers)

Unsupervised(statistical signals)

Weakly/DistantlySupervised

16

Unsupervised Phrase Mining

• Statistics based on massive text corpora• Popularity• Rawfrequency• FrequencydistributionbasedonZipfian ranks[Deane’05]

• Concordance• Significancescore[Churchetal.’91][El-Kishky etal.’14]

• Completeness• Comparisontosuper/sub-sequences[Parameswaran etal.’10]

17

Raw Frequency

• Frequent contiguous pattern mining• If“AB” isfrequent,likely“AB” couldbeaphrase

• It prefers• “Stopphrases”• Shorterphrases

• E.g., freq(vector machine) ≥ freq(support vector machine)

• Raw frequency could NOT reflect the quality of phrases

18

Raw Frequency (improved)

• Combine with topic modeling• Mergeadjacentunigramsofthesametopic[Blei &Lafferty’09]• Frequentpatternminingwithinthesametopic[Danilevsky etal.’14]

• Limitations• Tokensinthesamephrasemaybeassignedtodifferenttopics• E.g.knowledge discovery usingleastsquaressupportvector machineclassifiers…

19

Frequency Distribution• Idea: ranks in a Zipfian frequency distribution is

more reliable than raw frequency• Heuristic: Actual Rank / Expected Rank• Example:• Givenaphraselike“eastend”• ActualRank:rank“eastend”amongalloccurrencesof“east”(e.g.,“east end”,“east side”,“theeast”,“towardstheeast”,etc.)• ExpectedRank:rank“__end”amongallcontextsof“east”(e.g.,“__end”,“__side”,“the__”,“towardsthe__”,etc.)

20

Significance score • Significance score [Church et al.’91]• A.k.a.Zscore

• ToPMine [El-Kishky et al.’15]• Ifaphrasecanbedecomposedintotwoparts

• P = P1 ● P2• α(P1,P2)≈(f(P1●P2)̶µ0(P1,P2))/√f(P1●P2)

21

Qualityphrases

Significance score (cont’d)• Merge adjacent unigrams greedily if their

significance score is above the threshold.

22

Comparison to super/sub-sequences• Frequency ratio between an n-gram phrase

and its two (n-1)-gram phrases• Example

• Pre-confidence ofSanAntonio:2385/14585• Post-confidence ofSanAntonio:2385/2855

• Expand / Terminate based on thresholds

23

Phrase Rawfrequency

San 14585

Antonio 2855

SanAntonio 2385

Comparison to super/sub-sequences (cont’d)• Assumption

• Anti-example• “relationaldatabasesystem”isaqualityphrase.• Both“relationaldatabase”and“databasesystem”canbequalityphrases.

24

Ann-gramqualityphrase

Two(n-1)-gramsub-phrases

Atleastoneofthemisnotaqualityphrase.

Limitations of Statistical Signals

• The thresholds should be carefully chosen.• Only consider a subset of quality phrase

requirements.• Combining different signals in an

unsupervised manner is difficult.• Introducesomesupervisionmayhelp!

25

Three Families of Methods

Supervised(linguisticanalyzers)

Unsupervised(statistical signals)

Weakly/DistantlySupervised

26

Weakly / Distantly Supervised Phrase Mining Methods• SegPhrase [Liu et al.’15]• Weaklysupervised

• AutoPhrase [Shang et al.’17]• Distantlysupervised

27

SegPhrase

28

Document 1Citationrecommendationisaninterestingbutchallengingresearchproblemindataminingarea.

Document 2Inthisstudy,weinvestigatetheprobleminthecontextofheterogeneousinformationnetworksusingdataminingtechnique.

Phrase Mining

Document 3PrincipalComponentAnalysisisalineardimensionalityreduction technique commonly usedin machine learning applications.

Quality Phrases

PhrasalSegmentation

RawCorpus SegmentedCorpus

InputRawCorpus Quality Phrases SegmentedCorpus

• Outperform all above methods on domain-specific corpus (e.g., Yelp reviews)

Quality Estimation• Weakly Supervised

• Labels:Whetheraphraseisaqualityoneornot• “support vector machine”: 1• “the experiment shows”: 0

• For~1GBcorpus,only300labels

• Pros• Binaryannotationsareeasy

• Cons• Theselectionofhundredsofvarying-qualityphrasesfrommillionsofcandidatesshouldbecareful.

29

Phrasal Segmentation• Phrasal segmentation can tell which phrase is

more appropriate• Ex:Astandard⌈featurevector⌋ ⌈machinelearning⌋ setupisusedtodescribe...

• Effects on quality re-estimation (real data)• nphardinthestrongsense• nphardinthestrong• databasemanagementsystem

30

Notcountedtowardstherectifiedfrequency

From the Titles and Abstracts of SIGMOD

31

Query SIGMOD

Method SegPhrase Chunking(TF-IDF&C-Value)

1 database data base

2 databasesystem database system

3 relationaldatabase queryprocessing

4 queryoptimization queryoptimization

5 queryprocessing relationaldatabase

… … …

51 sql server databasetechnology

52 relationaldata databaseserver

53 datastructure largevolume

54 joinquery performancestudy

55 webservice webservice

… … …

201 highdimensionaldata efficientimplementation

202 location basedservice sensornetwork

203 xmlschema largecollection

204 twophaselocking importantissue

205 deepweb frequentitemset

… … …

OnlyinSegPhrase OnlyinChunking

From the Titles and Abstracts of SIGKDD

32

Query SIGKDD

Method SegPhrase Chunking(TF-IDF&C-Value)

1 datamining datamining

2 dataset association rule

3 association rule knowledge discovery

4 knowledgediscovery frequentitemset

5 timeseries decisiontree

… … …

51 associationrulemining searchspace

52 ruleset domain knowledge

53 conceptdrift importnant problem

54 knowledgeacquisition concurrencycontrol

55 geneexpressiondata conceptualgraph

… … …

201 web content optimalsolution

202 frequentsubgraph semanticrelationship

203 intrusiondetection effectiveway

204 categoricalattribute spacecomplexity

205 userpreference smallset

… … …

OnlyinSegPhrase OnlyinChunking

Reported by TripAdvisor(Find “Interesting” Collections of Hotels)

33

AutoPhrase• No label selection and annotation effort• Smoothly support multiple languages

34

AutoPhrase vs. Previous Work

35

Differentdomains

Differentlanguages

AutoPhrase’s Example Results

36

ReferencesDeane, P., 2005, June. A nonparametric method for extraction of candidate phrasal terms. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics (pp. 605-613). Association for Computational Linguistics.

Koo, T., Carreras Pérez, X. and Collins, M., 2008. Simple semi-supervised dependency parsing. In 46th Annual Meeting of the Association for Computational Linguistics (pp. 595-603).

Xun, E., Huang, C. and Zhou, M., 2000, October. A unified statistical model for the identification of English baseNP. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 109-116). Association for Computational Linguistics.

Zhang, Z., Iria, J., Brewster, C. and Ciravegna, F., 2008, May. A comparative evaluation of term recognition algorithms. In LREC.

Park, Y., Byrd, R.J. and Boguraev, B.K., 2002, August. Automatic glossary extraction: beyond terminology identification. In Proceedings of the 19th international conference on Computational linguistics-Volume 1 (pp. 1-7). Association for Computational Linguistics.

Witten, I.H., Paynter, G.W., Frank, E., Gutwin, C. and Nevill-Manning, C.G., 1999, August. KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries (pp. 254-255). ACM.

Liu, Z., Chen, X., Zheng, Y. and Sun, M., 2011, June. Automatic keyphrase extraction by bridging vocabulary gap. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (pp. 135-144). Association for Computational Linguistics.

Evans, D.A. and Zhai, C., 1996, June. Noun-phrase analysis in unrestricted text for information retrieval. In Proceedings of the 34th annual meeting on Association for Computational Linguistics (pp. 17-24). Association for Computational Linguistics.

37

ReferencesFrantzi, K., Ananiadou, S. and Mima, H., 2000. Automatic recognition of multi-word terms:. the c-value/nc-value method. International Journal on Digital Libraries, 3(2), pp.115-130.

Mihalcea, R. and Tarau, P., 2004, July. TextRank: Bringing order into texts. Association for Computational Linguistics.

Blei, D.M. and Lafferty, J.D., 2009. Topic models. Text mining: classification, clustering, and applications, 10(71), p.34.

Danilevsky, M., Wang, C., Desai, N., Ren, X., Guo, J. and Han, J., 2014, April. Automatic construction and ranking of topical keyphrases on collections of short documents. In Proceedings of the 2014 SIAM International Conference on Data Mining (pp. 398-406). Society for Industrial and Applied Mathematics.

Church, K., Gale, W., Hanks, P. and Hindle, D., 1991. Using statistics in lexical analysis. Lexical acquisition: exploiting on-line resources to build a lexicon, 115, p.164.

El-Kishky, A., Song, Y., Wang, C., Voss, C.R. and Han, J., 2014. Scalable topical phrase mining from text corpora. Proceedings of the VLDB Endowment, 8(3), pp.305-316.

Parameswaran, A., Garcia-Molina, H. and Rajaraman, A., 2010. Towards the web of concepts: Extracting concepts from large datasets. Proceedings of the VLDB Endowment, 3(1-2), pp.566-577.

Liu, J., Shang, J., Wang, C., Ren, X. and Han, J., 2015, May. Mining quality phrases from massive text corpora. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 1729-1744). ACM.

Shang, J., Liu, J., Jiang, M., Ren, X., Voss, C.R. and Han, J., 2017. Automated Phrase Mining from Massive Text Corpora. arXiv preprint arXiv:1702.04457.

38