31
How can we capture multiword expressions? Seongmin Mun 1 , Guillaume Desagulier 2 , Anne Lacheret 3 , Kyungwon Lee 4 1 Lifemedia Interdisciplinary Program, Ajou University, South Korea 1,3 UMR 7114 MoDyCo - CNRS, University Paris Nanterre, France 2 UMR 7114 MoDyCo - University Paris 8, CNRS, University Nanterre 4 Department of Digital Media, Ajou University, South Korea

How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

How can we capture multiword expressions?

Seongmin Mun1, Guillaume Desagulier2, Anne Lacheret3 , Kyungwon Lee4

1 Lifemedia Interdisciplinary Program, Ajou University, South Korea1,3 UMR 7114 MoDyCo - CNRS, University Paris Nanterre, France

2 UMR 7114 MoDyCo - University Paris 8, CNRS, University Nanterre4 Department of Digital Media, Ajou University, South Korea

Page 2: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Introduction

Topics in a text corpus include features and information.

Analyzing these topics can improve a user’s understanding of the corpus.

2/31

Page 3: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Previous studies

WEIWEI CUI SHIXIA LIU Z. W. H. W.: How hierarchical topics evolve in large text corpora. In IEEE Transactions on Visualization and Computer Graphics (2014), vol. 20, pp. 2281–2290.

3/31

Page 4: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Research background and purpose

Topics can be broadly divided into two categories.

4/31

Page 5: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Research background and purpose

“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”

5/31

Page 6: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Research background and purpose

“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”

Gratitude meaning that can be expressed in one word

6/31

Page 7: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Research background and purpose

“With profound gratitude and great humility, I accept your nomination for the presidency of the United States.”

United States meaning must be described using a combination of words.

7/31

Page 8: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Research background and purpose

How can we capture multiword expressions?

To this aim, we designed an algorithm.

8/31

Page 9: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus Processing Topic candidate

Topic validation

Generate topics

9/31

Page 10: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus Processing Topic candidate

Topic validation

Generate topics

10/31

Page 11: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus(U.S. president speeches)

https://millercenter.org/the-presidency/presidential-speeches

11/31

Page 12: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus(U.S. president speeches)

12/31

Page 13: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus(U.S. president speeches)

13/31

Page 14: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus Processing Topic candidate

Topic validation

Generate topics

14/31

Page 15: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Processing• N-grams• POS tagging

Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing

N-gram method is a contiguous sequence of N items from a given sequence of text.

15/31

Page 16: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Processing• N-grams• POS tagging

Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing

“Time flies like an arrow.”

16/31

Page 17: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Processing• N-grams• POS tagging

Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing

“Time flies like an arrow.”

Unigram : Time, flies, like, an, arrow.Bigram : Time flies, flies like, like an, an arrow.Trigram : Time flies like, flies like an, like an arrow.

17/31

Page 18: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Processing• N-grams• POS tagging

Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing

18/31

Page 19: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus Processing Topic candidate

Topic validation

Generate topics

19/31

Page 20: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Topic candidate extraction & filtering• Frequency counting• Filters :

ü Stopwordsü Thresholds

20/31

Page 21: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus Processing Topic candidate

Topic validation

Generate topics

21/31

Page 22: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Topic validation• Human annotation• Matching

with Dictionaries

English dictionaries

• THE DEVIL'S DICTIONARY ((C)1911 Released April 15 1993)• Easton's 1897 Bible Dictionary• Elements database 20001107• The Free On-line Dictionary of Computing (27 SEP 03)• U.S. Gazetteer (1990)• The Collaborative International Dictionary of English v.0.44• Hitchcock's Bible Names Dictionary (late 1800's)• Jargon File (4.3.1, 29 June 2001)• Virtual Entity of Relevant Acronyms (Version 1.9, June 2002)• WordNet (r) 2.0• CIA World Factbook 2002• User Dictionary

22/31

Page 23: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus Processing Topic candidate

Topic validation

Generate topics

23/31

Page 24: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Visual system

http://ressources.modyco.fr/sm/MultiwordVis/

24/31

Page 25: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Ambiguous sentence

“Shall I wake him up?”

25/31

Page 26: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Ambiguous sentence

We can’t extract wake up if we only use N-gram algorithm.

“Shall I wake him up?”

26/31

Page 27: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Dependency tag

Dependency tag can provide a simple description of the grammatical relationships in a sentence.

27/31

Page 28: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Improving algorithm

28/31

Page 29: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Improving algorithm

N-gram Dependency tag

29/31

Page 30: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Data processing

Raw corpus Processing Topic candidate

Topic validation

Generate topics

DistinguishSentence

Storing results

Processing• N-grams• Dependency tag• POS tagging

Pre-processing• Cleaning with RegExp• Lemmatization• Tokenization• Lowercasing

30/31

Page 31: How can we capture multiword expressions? · • The Collaborative International Dictionary of English v.0.44 • Hitchcock's Bible Names Dictionary (late 1800's) • Jargon File

Q&A

Thank you for [email protected]

31/31