18
Discovering and Describing Coherent and Meaningful Topics from a Text Collection Henry Anaya- anchez Introduction Discovering topics from term pairs Methodology Evaluation Conclusions Discovering and Describing Coherent and Meaningful Topics from a Text Collection Henry Anaya-S´ anchez IR&NLP-UNED May 7th, 2013 1 / 18

Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Discovering and Describing Coherent andMeaningful Topics from a Text Collection

Henry Anaya-Sanchez

IR&NLP-UNED

May 7th, 2013

1 / 18

Page 2: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Content

1 Introduction

2 Discovering topics from term pairs

3 Methodology

4 Evaluation

5 Conclusions

2 / 18

Page 3: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Motivation

i. The need of information systems and users for analyzing,structuring, and summarizing large collections of textdocuments according to the main subject themes thatrun over the collection documents (i.e., their topics).

ii. Traditional approaches to discover and describe topicsbased on clustering and Probabilistic Topic Modeling(PTM) are insufficient to always provide ostensibleend-users with coherent and meaningful topics.

3 / 18

Page 4: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Motivation

Clustering & PTM are unsupervised learning techniques thathas been widely used in the process of topic discovery fromdocuments.

i. Clustering methods aim at generating document groups orclusters, each one representing a different topic.

ii. PTM approaches focus on learning a set of worddistributions aimed at generating each document in acollection to represent the topics.

4 / 18

Page 5: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Motivation

However, the obtained clusters/distributions do not necessarilycorrespond to actual topics of interest:

i. They do not always correlate with human judgements soas to always provide ostensible end-users withsemantically coherent (interpretable, subject-based) andmeaningful (main theme vs. background) topics thatsummarize the content comprised in a text collection.

5 / 18

Page 6: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Motivation

However, the obtained clusters/distributions do not necessarilycorrespond to actual topics of interest:

i. They are actually clusters/probability distributions ofwords with a statistically meaning that sometimes aredifficult to interpret and explain by humans since theinformation they convey in many cases is not at all relatedto a subject.

6 / 18

Page 7: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Motivation

On the other hand,

i. Clustering methods do not provide descriptions thatsummarize the clusters’ contents (so that users can judgetheir relevance).

ii. The descriptions provided by PTM approaches arecurrently limited to list the most probable (frequent) termsunder each distribution.

7 / 18

Page 8: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Motivation

This talk presents an approache for discovering topics focusedon:

1 how to discover the semantically coherent and meaningful(interpretable, subject-heading like) topics comprised in atext collection, and

2 how to simultaneously provide an appropriate descriptionfor each topic so that humans can easily judge itsrelevance.

8 / 18

Page 9: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

The methodology is closely related to the series of works: FIHC(Fung et al., 2003), CFWS (Li et al., 2008) and the methodproposed by Malik and Kender (2006); that aim at obtainingsimultaneously both the coverage of a topic and its descriptionby means of a new clustering criterion based on the concept offrequent term set (i.e. a set of terms that co-occur in at least aminimum number of documents in the text collection).

In these works, document clusters and their descriptions aredetermined by the frequent term sets of the documentcollection.

9 / 18

Page 10: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

Similarly, our approach relies on highly probable term pairsgenerated from the collection. However, we use these pairs onlyas a guide to explore the possible topics of the collection.

Topics and their descriptions are generated from term pairsdeemed to be representative of a collection topic.

We introduce the concept of homogeneity of a document set,which is aimed at checking if a set of documents is cohesiveenough to represent a topic.

10 / 18

Page 11: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

11 / 18

Page 12: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

We define the probability of generating a pair of terms{ti, tj} ∈ P from C as:

P({ti, tj}|C) =∑d∈C

P(ti|d)P(tj |d)P(d|C) (1)

12 / 18

Page 13: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

We propose a novel way to estimate the homogeneity of a setof documents (specifically, for the support set of a given termpair) by analyzing its possible content coverage.

It relies on the concept of pure entropy of a partitionΘ = {Θ1, . . . ,Θq}:

H(Θ) = −q∑

i=1

P (Θi|Θ) log2 P (Θi|Θ) (2)

where P (Θi|Θ) can be estimated as |Θi|/q∑

j=1|Θj |.

13 / 18

Page 14: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

Additionally, we provide larger descriptions based on thelikelihood ratio score (Dunning, 1993).

This score has been widely used for estimating the correlationof terms with respect to a target topic (Lin and Hovy, 2000;Harabagiu and Lacatusu, 2005).

14 / 18

Page 15: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

For evaluating this approach we have used three benchmarkcollections: AFP,3 TDT2 version 4.0 and Reuters-21578.4 TheReuters dataset is composed by stories that have been taggedwith the attribute TOPICS = YES and include a BODY part.

These collections are different in terms of number of topics,topic sizes, number of dimensions and document distribution.

15 / 18

Page 16: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

16 / 18

Page 17: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

A topic discovery methodology based on term pairs

17 / 18

Page 18: Discovering and Describing Coherent and Meaningful Topics ...Clustering & PTM are unsupervised learning techniques that has been widely used in the process of topic discovery from

Discoveringand

DescribingCoherent and

MeaningfulTopics from a

TextCollection

Henry Anaya-Sanchez

Introduction

Discoveringtopics fromterm pairs

Methodology

Evaluation

Conclusions

Conclusions

- A new methodology for discovering and describing thecoherent and meaninful topics comprised in a textcollection has been introduced.

- The proposed algorithm provides a novel parameter-lessmethod for discovering the topics from the collection, atthe same time that it attaches suitable descriptions to thediscovered topics.

- The experiments carried out over TDT2 English corpus,AFP Spanish collection and Reuters-21578 validate ourproposal and show significant improvements over existingmethods in terms of the standard macro- andmicro-averaged F1 measures.

18 / 18