24
Automatic Keyphrase Extraction by Bridging Vocabulary Gap Xinxiong Chen Tsinghua University 2013-04-26 06/15/2022 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu. cn 1

Automatic Keyphrase Extraction by Bridging Vocabulary Gap

Embed Size (px)

DESCRIPTION

Automatic Keyphrase Extraction by Bridging Vocabulary Gap. Xinxiong Chen Tsinghua University 2013-04-26. Main Idea. Vocabulary gap: Appropriate keyphrases are not always statistically significant or even do not appear in the given document. - PowerPoint PPT Presentation

Citation preview

Page 1: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Automatic Keyphrase Extraction by Bridging Vocabulary Gap

Xinxiong ChenTsinghua University

2013-04-26

04/19/2023 THUNLP, Tsinghua University http://nlp.csai.tsinghua.edu.cn

1

Page 2: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Main Idea Vocabulary gap: Appropriate keyphrases

are not always statistically significant or even do not appear in the given document.

Use word alignment models in statistical machine translation to learn translation probabilities between the words in documents and the words in keyphrases.

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 2

Page 3: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Introduction – Keyphrase What is keyphrase

a set of terms selected from a document as a short summary of the document.

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 3

Page 4: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Introduction – Keyphrase Extraction

Why keyphrase extraction Digital libraries Information Retrieval

Goal : automatically extract keyphrases from documents Unsupervised

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 4

Page 5: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Example A News article: (translated from

Chinese)

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 5

Title Israeli Military Claims Iran Can Produce Nuclear Bombs and Considering Military Action against Iran

Summary …

Content …

Keywords Israeli , Iran , Nuclear bombs , Nuclear weapon

Page 6: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Example

Existing unsupervised method: TFIDF : Nuclear bombs , Iran , Israeli ,

enriched uranium , speech TextRank : Iran , Israeli , chief , Nuclear

bombs , Military Use a window whose size is a constant to build a word graph Use PageRank to decidewhich word is more important

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 6

TF Israeli(6) Iran(6) intelligence(5) nuclear bombs(4) enriched uranium(3)…nuclear weapon(1)

Page 7: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Example LDA : Iran , England , America , Nation

, Speech Learn topics from documents

ExpandRank : Iran , enriched uranium , Israeli , atomic energy, Lebanon Find k nearest neighbor documents to

build word graphs

04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

7

Page 8: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Idea - Association If a word is mentioned, it remind people

of other words. iPhone – Apple Nuclear bombs – Nuclear Weapon

What is the probability between “Nuclear bombs” and “Nuclear Weapon”?

04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

8

Nuclear bombs Nuclear Weapon

Page 9: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Idea – SMT for Keyphrase Extraction

Both the content and the keyphrase are parallel summaries of a news

Unsupervised : Use title or summarization instead

Estimate the translation probabilities between the words in content and title word alignment models

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

News

Content Title(Summarization)Translation

04/19/2023 9

Page 10: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Translation Probability Example:

Nuclear bombs: Nuclear bombs : 0.515757 Liquid : 0.0871815 Nuclear Weapon : 0.0808868 Military Action : 0.0239178 Israeli Military : 0.0215988 Miniaturization : 0.0118 Possible : 0.0113688 enriched uranium : 0.0100252

04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

10

Page 11: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Keyphrase Extraction Using WAM

Given news, rank keyphrases by computing the scores

Iran , Israeli , chief, Nuclear bombs , Military …

Iran , Israeli , chief, Nuclear bombs , Nuclear weapon , Military , speech

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 11

Page 12: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Word Trigger Method (WTM) Three Steps :

Preparing translation pairs Learning a translation model

IBM Model-1 Extracting keyphrase given a resource

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 12

Page 13: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Translation Pairs Length unbalance problem

Unable to list all tags on the annotation side

Tags may have different importance for the resource

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 13

Page 14: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Content-Title Pairs Length unbalanced problem

Unable to list all tags on the annotation side

Tags may have different importance for the resource

Sampling Method Tag weighting type

TFt, TF-IRFt

Length ratio

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 14

Page 15: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Learning Translation Probabilities IBM Model-1 as WAM algorithms

Asymmetric: Prd2a(t|w), Pra2d(t|w) Linear Combination

Prd2a(t|w)

Pra2d(t|w) When λ = 1 or λ = 0, it simply uses model

Prd2a(t|w) or Pra2d(t|w) correspondingly

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 15

content title

title content

Page 16: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Tag Suggestion Using Triggered Words

Given description, rank tags by computing the scores

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 16

Page 17: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Tag Suggestion Using Triggered Words

Given description, rank tags by computing the scores

Trigger power of the word w in the content TF-IRFw

TextRank Their product

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 17

Page 18: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Keyphrase Extraction Using Triggered Words

Given description, rank tags by computing the scores

Translation probabilities from words in description to keyphraes

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 18

Page 19: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Emphasize Tags Appearing In Content for WTM (EWTM)

Emphasize tags appearing in description

It(w): indicator function to emphasize the tags appearing in content Gets 1 when t = w Gets 0 when t != w

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 19

Page 20: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Experiments Datasets

13702 news from www.163.com

Evaluation Metrics Precision, recall and F-measure

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 20

Words in documents 72900

Words in keyphrases 12405

Lengths of document 971.7 words

Lengths of titles 11.6 words

Lengths of summarization 45.8 words

Num of Keyphrases 2.4

Page 21: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Experiment Results

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 21

Page 22: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Parameters – Length Ratio The length ratio: content/title

04/19/2023 THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

22

Page 23: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

THUNLP, Tsinghua Universityhttp://nlp.csai.tsinghua.edu.cn

04/19/2023 23

SINA APP(http://app.thunlp.org/weibo)Now we have more than 2 million registered users

Application

Page 24: Automatic  Keyphrase  Extraction by Bridging Vocabulary Gap

Thank you ! Q & A