Textual Alignment of News and Blogs · 2016. 9. 7. · This kind of textual analysis has interested a group of researchers at ThomsonReuters. They would like to gain additional insight

Textual Alignment of News and Blogs

Author:

Carlos A. Morales Garduno

Supervisor:

Dr. Mark R. Stevenson

A report submitted in fulfilment of the requirements

for the degree of MSc in Advanced Computer Science

in the

Department of Computer Science

September 7, 2016

Declaration

All sentences or passages quoted in this report from other people’s work have

been specifically acknowledged by clear cross-referencing to author, work

and page. Any illustrations which are not the work of the author of this

report have been used with the explicit permission of the originator and

are specifically acknowledged by clear cross-referencing to author, work

and page.

I understand that failure to do this amounts to plagiarism and will be con-

sidered grounds for failure in this project and the degree examination as

a whole.

Name: Carlos Alberto Morales Garduno

Date: September 7, 2016

Signature:

II

Abstract

The number of blog posts published each day is estimated to be above 3 million.

As more people is using the blogosphere as its source of information. This change has

represented an issue for traditional media which needs to leverage the new opportunities

offered by the medium. A key step for news researchers, the automated identification

of links between news and blogs, would enable further analytical processing. Tracking

public response, identifying new leads, following the stories after print, new forms of

public engagement, are among the motivations for this. Experiments were conducted

using a subset of the New York Times Annotated Corpus 05 and the TREC BLOSG06

collection.

This project explores different techniques to identify the links. A baseline was defined

using cosine distance. Then, a procedure was developed to extrapolate document-to-

document comparisons for the exploration of topical relations. Later, an LDA imple-

mentation was explored to improve the baseline. A further improvement is introduced

by performing a second comparison with Kullback-Leibler divergence. To compare the

performance Precision and Recall were used. Additionally, a gold standard was created

which would enable further advancing the research in this area.

III

Acknowledgements

I would like to thank my supervisor, Dr Mark R. Stevenson, for taking me under his

guidance to do this Master’s Dissertation. Without his consistent support and shared

knowledge, it would not have been possible to complete this project.

Also, I am thankful with my personal tutor, Dr Robert Gaizauskas, for his feedback

and advise provided during this project and through the academic year.

I would also like to thank my wife Imelda Gonzalez for her continuous support and

companion on this project, especially during some critical moments.

This project was developed with the support of the Mexican National Council for

Science and Technology’s grant number 411174.

IV

Contents

1 Introduction 1

1.1 Project Aims and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Project Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Literature Survey 5

2.1 Topic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.4 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.5 Topic Detection and Tracking . . . . . . . . . . . . . . . . . . . . . 8

2.1.6 Tree-Based Topic Models . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.7 Symbolic Aggregate ApproXimation SAX . . . . . . . . . . . . . . 9

2.1.8 Hierarchical Relational Models for Topic Networks . . . . . . . . . 9

2.2 Document Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Shingling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Rabin Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Jaccard Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 jusText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 The GoldMiner algorthm . . . . . . . . . . . . . . . . . . . . . . . 13

V

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Methodology 16

3.1 Project Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2 The Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 The New York Times Annotated Corpus . . . . . . . . . . . . . . . 19

3.2.2 TREC Blogs 06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Creating the Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.5 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5.1 Baseline - Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . 26

3.5.2 Word overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5.3 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 Experiments and Results 29

4.1 Experimental Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Metrics for this evaluation. . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 Removing boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Cosine distance and word overlap . . . . . . . . . . . . . . . . . . . 33

4.3.2 LDA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.3 Observations on the Gold Standard . . . . . . . . . . . . . . . . . . 34

4.3.4 Topic Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.5 Additional observations . . . . . . . . . . . . . . . . . . . . . . . . 37

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5 Conclusion and Future Work 40

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Appendices 46

A Blogs 06 full XML sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

B NYT 06 Annotated Corpus full XML sample . . . . . . . . . . . . . . . . 52

C Gold Standard file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

D Topic Evolution During Training . . . . . . . . . . . . . . . . . . . . . . . 64

VI

List of Figures

2.1 Sample webpage showing the relevant content highlighted in gren and boil-

erplate in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 Sample XML content from the NYT Annotated Corpus dataset. . . . . . 19

3.2 Sample XML content from the TREC Blogs 06 dataset. . . . . . . . . . . 20

3.3 Screenshot showing the original data split of the blogs corpus. . . . . . . . 21

4.1 Perplexity during training . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 Visual analysis of the infered topics . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Visualization of duplicated topics. . . . . . . . . . . . . . . . . . . . . . . 38

4.4 The topics being duplicated seemed to increase as we targeted a higher

number of them during training. . . . . . . . . . . . . . . . . . . . . . . . 39

VII

List of Tables

1.1 Linking Example between Blog: BLOG06-20051207-043-0003056710, and

News Article: 1649918.xml . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1 Precision and Recall pseudo-contingency table. . . . . . . . . . . . . . . . 17

3.2 Example of the topic linking per document, as specified in our gold standard 23

4.1 Number of blog posts to process, with and without filtering boilerplate . . 30

4.2 Number of tokens to process in two documents, with and without filtering

boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.3 Baseline and first features compared. . . . . . . . . . . . . . . . . . . . . . 33

4.4 Top 10 frequent words in the first 15 topics of the 25 topics LDA model . 35

4.5 Performance for the LDA model trained on 5 topics . . . . . . . . . . . . 36



4.8 General precission and recall achieved by each LDA model. . . . . . . . . 37

1 Topic evolution through training iterations . . . . . . . . . . . . . . . . . 65

VIII

1Introduction

Today, Web 2.0 applications are a common tool for sharing content online. Every

second, individuals and organised institutions alike use them to contribute to a broad

range of topics. Obeying the guidelines governing each platform, their formats range

from only short sentences or multimedia, as is the case of micro-blogs, to longer texts

interleaved with miscellaneous types of multimedia as the most permissive format of blogs

allows.

Since they first appeared back in 1999, blogs have shown a steadily growing tendency

on content and user base. The exact size of the blogosphere could be difficult to assess;

according to the statistics at wordpress.com more that 60 million blogs are published

each month; a similar estimate by technorati.com proposes that around 4 million are

created each day.

With topics ranging from the purely personal to covering world breaking news, blogs are

attractive from a data mining perspective. To properly exploit this potential traditional

NLP tools need to be adapted in order to leverage the diversity of writing styles common

in blogs. Otherwise, it would not be feasible to make sense of all the information available.

This kind of textual analysis has interested a group of researchers at ThomsonReuters.

They would like to gain additional insight on how their traditionally generated content

relates to the blogosphere. As Newman suggests, the way traditional and social media

interact has changed. By linking the blog posts related to news articles, not only can

they assess public reaction, additionally, the journalists can expand their stories or look to

obtain new leads. [Newman and of Oxford. Reuters Institute for the Study of Journalism,

2009]

1

1.1. PROJECT AIMS AND GOALS

To perform this initial task, a custom framework is required which would realise the

comparisons needed and generate a user-interpretable output with the result.

After the described initial approach is completed, the gained insight would enable an

informed decision to be made in order to proceed to perform deeper analysis as required.

To ground this project our implementation will focus on two datasets. Firstly, from

the New York Times Annotated Corpus [Sandhaus, 2008] a subset corresponding to the

year 2005 will be used to train our model. Then aiming to identify what blog posts link

with those news articles, the TREC’s Blogs 06 [Macdonald and Ounis, 2006] corpus will

be processed.

Key factors when selecting these datasets were the timeframe coverage, textual di-

versity, availability, and ease of access. We were aiming for the highest number of news

articles and blogs posts published during a year. From the few other blogs datasets fulfill-

ing the requirements. The TREC Blogs 08 and The ICWSM-2009 Spinner dataset, both

had a premium price tag. Another one made available by the Common Crawl1 organisa-

tion required extra analysis and processing to isolate the blogs they collect. With respect

to the news dataset, the Reuters Corpora2 seemed promising but was not available at

the time.

1.1 Project Aims and Goals

1.1.1 Project Aims

The core aim of the project is to generate a collection of links between news stories

published online and blog posts covering the same topic. This collection would then

allow subsequent analytical processing. Topic coverage and trend changes are among the

metrics to explore.

1.1.2 Goals

After reviewing literature related to our work, we envisaged the following approach to

realise the proposed goals.

1. Gold Standard: Currently, no dataset of this kind is available. In order to perform

the comparisons we propose, one must be manually crafted. Given the datasets

1http://commoncrawl.org/2http://trec.nist.gov/data/reuters/reuters.html

2

CHAPTER 1. INTRODUCTION

size, their mutual differences, and the effort required to generate a quality product,

this creation is considered to be not only a fundamental step for this project, but

a major contribution. Such a task would enable future comparisons of competing

models without the burden of creating a similar dataset.

2. Topic Modelling: There is the need to consider that one or more news articles

can be covering the same topic and a single article can expand across two or more

topics. Also, the timeframe for event coverage can span from one day to a couple

of weeks. Relevant articles for a topic can be identified and linked to blog posts

covering the same topic.

3. Topic Linking: Lastly, of special interest for this work is the exploration of means

leading to the successful detection of links between news and blogs. Based on the

inherent differences among our data sets, this suggests being a challenging task.

To motivate this point, Table 1.1.2 shows a blog post and a news article covering

the same topic. The reader should at least notice the syntactical and grammatical

differences present.

1.2 Overview of the Report

In order to visualise the structure of this report we now provide a high-level description

of each chapter.

Chapter 2 presents a review of previous works and an introduction to those techniques

used in the project.

Chapter 3 allows us to discuss the methodology followed, it also introduces our technical

and risk analysis.

Chapter 4 contains our experiments and observed results.

Chapter 5 gives way to our conclusion and future work.

3

1.2. OVERVIEW OF THE REPORT

Blog Post News Article

CNN Clarifies Remarks by News Exec.

Eason Jordan

Two weeks ago at the World Economic

Forum in Switzerland, CNN’s chief

news executive, Eason Jordan, raised

eyebrows when he suggested that some

of the 63 journalists who have been

killed in Iraq had been targeted by US

troops. Although Jordan quickly

tempered the remarks, a controversy

has been building over them on the web.

CNN has responded, issuing a statement

clarifying Jordan’s comments. It reads as

follows: ”While the majority of the 63

journalists killed in Iraq have been

killed by insurgents, the Pentagon has

acknowledged that the US military on

occasion has killed people who turned

out to be journalists,” the CNN

statement said. ”Mr. Jordan emphatically

does not believe that the US military

intended to kill journalists and now

believes that the US military mistakenly

identified the journalists as Iraqi civilians.”

# posted by **** **** @ 2:23:48 PM

CNN Executive Resigns Post

Over Remarks

Eason Jordan, a senior executive

at CNN who was responsible for

coordinating the cable network’s

Iraq coverage, resigned abruptly

last night, citing a journalistic

tempest he touched off during a

panel discussion at the World

Economic Forum in Davos,

Switzerland, late last month in

which he appeared to suggest that

United States troops had

deliberately aimed at journalists,

killing some.,Though no transcript

of Mr. Jordan’s remarks at Davos

on Jan. 27 has been released, the

panel’s moderator, David Gergen,

editor at large of U.S. News &

World Report, said in an interview

last night that Mr. Jordan had

initially spoken of soldiers, ”on

both sides,” who he believed had

been ”targeting” some of the more

than four dozen journalists killed

in Iraqmore than four dozen

journalists killed in Iraq

Table 1.1: Linking Example between Blog: BLOG06-20051207-043-0003056710, and

News Article: 1649918.xml

4

2Literature Survey

We present an introduction to core topics for the project with the intention to famil-

iarise the readers with the field and ease their way through this document.

2.1 Topic Modelling

As Blei explains in his probabilistic topic models review, topic modelling algorithms are

statistical methods capable of analysing a text corpus to identify what themes go through

it, how they connect and change over time [Blei, 2012]. It is important to mention that,

as unsupervised models, they infer those hidden parameters on their own with no need

for pre-classified or pre-labeled texts.

The simplest of these models, Latent Dirichlet Allocation, was introduced in 2010 and

is the focus of our work.

2.1.1 Latent Dirichlet Allocation

Classified as a generative probabilistic topic model, it assumes that data is generated

from a joint probability distribution of observed and hidden random variables [Blei, 2012].

Considering the topics and their distributions as the hidden variables, and the words of

the documents as the observed variables.

To formally describe this model, we present its fundamental formulae using equation

2.1 as shown on Blei’s review, where the topics are represented by β1:K . The topic

proportions for the dth document are θd. The topic assignments for the dth document

5

2.1. TOPIC MODELLING

are zd. Finally, the observed words for document d are wd.

p(β1:K , θ1:D, z1:D, w1:D) =K∏i=1

p(βi)D∏

d=1

p(θd)

(N∏

n=1

p(zd,n|θd)p(wd,n|β1:k, zd,n)

). (2.1)

As explained by Blei, this model has a number of dependencies. Namely, the topic

assignments zd,n depend on the per-document topic proportions θd. The observed words

wd,n depend both on the topic assignment zd,n and on the topics available β1:K [Blei,

2012].

An intuition of the concept underpinning this model lays on the assumption that each

topic has an inherently strong bond with a previously defined collection of words. These

words are then sampled to generate the documents. On this basis, statistical analysis of

the textual content from the documents in a corpus allows inferring its topics.

Two common approximations to said inference are categorszed as sampling approaches

or optimisation approaches. Both are described later in this document.

Additional to the number of implementations and extensions to LDA covered by Blei’s

review [Blei, 2012], we briefly introduce five recent publications involving this model.

Firstly, [Han et al., 2013] present an extension of LDA used to rank web forum posts.

Secondly, [Qi et al., 2015] describe an LDA version able to handle briefly appearing

topics. Thirdly, [Vorontsov and Potapenko, 2014] added linguistic assumptions to the

model. Fourthly, [Madani et al., 2015] did an implementation to detect trending topics in

real-time on Twitter. Fifthly, [Tristan et al., 2015] propose an alternative training method

using Graphical Processing Units . Finally, [Hu et al., 2013] propose a method to allow

non-experts in machine learning to provide their feedback to the models generated by

LDA .

Apart from them being covered as part of the initial exercise of familiarising with topic

models. These papers are mentioned here to attest the prevalent popularity of this model

within the research community .

A highly relevant issue for our project and a key principle of LDA, The previously

mentioned a-priori definition of topic distributions, can turn into a difficulty when trying

to identify these distributions with no prior knowledge of the topics. An active area of

research has been addressing this problem.

6

CHAPTER 2. LITERATURE SURVEY

As explained by Battisti et al., defining the correct number of topics is not a trivial

task. There is a tradeoff between setting a number large enough to cover all the topics

present in the corpus, while keeping it as low as possible to match the number of topics

identified by domain experts [Battisti et al., 2015]. To aid this task, a measure has been

introduced under the name of perplexity. This process is then approached by trying to

approximate the adequate number via a variant of Expectation Maximisation.

2.1.2 Perplexity

Defined as the reciprocal geometric mean of the likelihood of a test corpus given the

model, it measures the ability of a model to generalise on unseen data [Madani et al.,

2015]. In other words, it gives a measure of how well the model describes or represents

data it has not seen before. The equation 2.2 formalises this concept, here ...,

PP (W ) = exp

{− log(p(w))∑D

d=1

∑Vj=1 n

(jd)

}(2.2)

When evaluating LDA models, lower perplexity scores mean the model is best repre-

senting the words in a test document given the inferred topics.

With respect to the inference task, as previously stated, common approaches in this

area are categorised as sampling or optimisation. Based on their relevancy for our project,

we describe two instances of them, one from each of the classes.

2.1.3 Gibbs Sampling

A disseminated sampling technique for LDA to discover what topics go through the

dataset, the topic distributions, and how each word corresponds to a given topic [Battisti

et al., 2015; ?].

It is a common practice to perform Gibbs sampling by integrating together the per-

document distribution over topics and the topic distributions over words, and sampling

only for the per-word topic assignment [Battisti et al., 2015; Hu et al., 2013; Vorontsov

and Potapenko, 2014; Yan et al., 2013]. This technique is formalised by the following

formulae,

p(zd,n = k|Z , wd,n) ∝ (αk + nk|d)β + nwd,n|k

βV + n.|k, (2.3)

7

2.1. TOPIC MODELLING

Where d is the document index, and n is the token index in that document; nk|d is

topic k’s count in document d; αk is topic k’s prior; Z are the topic assignments excluding

the current token wd,n; a is the prior for word wd,n; nwd,n|k is the count of tokens with

word wd,n assigned to topic k; V is the vocabulary size, and n.|k is the count of all tokens

assigned to topic k.

Given that optimisation methods have generally been found to be faster, and as accu-

rate as Gibbs sampling at inference when dealing with large datasets, they are favoured

over the sampling methods.

2.1.4 Variational Bayes

Methods in this class aim at optimising a simplified parametric distribution in order

to make it closer to the posterior, divergence-wise. An online stochastic version of Varia-

tional Bayes is used in this project. Although a core strength of this method is its ability

to handle large volumes of data with no need for local storage. We leverage only its

convergence speed [Hoffman et al., 2010]

2.1.5 Topic Detection and Tracking

It is a programme created in 1999 by the National Institute of Standards and Tech-

nology (NIST) with the aim of developing technologies to search, organise and structure

multilingual news materials. Its introduction defined two fundamental ideas in the area,

the continuous processing of data as it is being collected, and the concept of a topic. In

its first versions, it defined five research tasks: Topic Tracking, Link Detection, Topic

Detection, First Story Detection and Story Segmentation [Fiscus and Doddington, 2002].

It is worth mentioning, several of the NLP technologies nowadays available were mo-

tivated by this programme.

2.1.6 Tree-Based Topic Models

In his work, Hu et al. mention a weakness of LDA, using a symmetric prior for all

words, ignoring relations between them. To overcome this limitation, he proposes tree-

based topic models, arguing trees are a natural formalism for encoding lexical information

[Hu et al., 2013].

8


He replaces multinomial distributions with their tree-structured equivalences for each

topic, associating each leaf node with a word. To encode correlations in a tree, the prior

for each word needs to be known. Then starting from the root node he links with it every

correlated word having the same prior. Later, for each of the linked words if they are

correlated, he removes their previous link and connects them together with a new parent

node. Linking this new joint node with the root, thus merging links for correlated words.

Later, to generate each topic distribution, one needs to traverse the tree. Starting from

the root node, take a child node. If is not a leaf node, take a child again and repeat until

reaching a leaf node. The described process would be equivalent to sampling once from

a multinomial distribution.

2.1.7 Symbolic Aggregate ApproXimation SAX

An algorithm for event discovery based on transforming word temporal series into

a string of symbols [Stilo and Velardi, 2016]. It works in conjunction with Regular

Expressions and clustering algorithm to detect events in social media streams.

Although this work does not deal with topic identification, the author justifies the

comparison of his model against a variation of LDA, because the last one is foundational

for competing models in topic identification.

2.1.8 Hierarchical Relational Models for Topic Networks

This work proposes to use techniques from Network Analysis to consider each document

as a node on the network and to identify relations or links between documents based only

on each document attributes’ [Wang et al., 2015]. Said attributes could be the document’s

author, its collaborators, and so on. If a trained model is presented with a node and its

links, it would predict a distribution of node attributes.

An improvement over competing network based models, this handles both the structure

of the network and the attributes of its nodes at the same time.

As in traditional LDA, this model considers documents as being generated from topic

distributions while the links between documents are represented as binary variables, and

are treated as dependent on the topics that generated each document.

9

2.2. DOCUMENT SIMILARITY

2.2 Document Similarity

Nowadays it is estimated that as much as 40% of the internet web pages are duplicated

[Manning et al., 2008]. There is a variety of reasons for that, from simple redundancy

policies to copycat websites trying to trick users. This causes an increased need for re-

sources like storage, bandwidth, hardware maintenance and so on. For that reason, it has

become a keen topic for the search and storage research community. Many methods have

emerged trying to address the different issues involved still, as mentioned by Manning

[2008], this phenomenon represents an open problem which requires addressing.

It is appropiate now to present the Vector Space Model, a key concept when dealing

with document representations and similarities.

Vector Space Model Each document is represented as a vector in an n-dimensional

space. To do so, a technique known as Bag-of-Word is used to contain the term frequency

count of each document elements. To perform a query against this space, the query is

converted to its vector space model representation. This conversion is equivalent to

placing the query in its corresponding location of the vector space. Then, the vectors

located closest to the query correspond to the most similar documents.[Salton et al.,

1975]

There are several options to measure this similarity. The inner product of the vectors

being one of them or a measure of the angle between them can be computed. We opted

for the cosine distance.

Cosine distance, A measure of similarity between two vectors, it does so by measuring

the cosine of the angle between them. It is bounded to 0 and 1 and is defined by the

equation 2.2. [Singhal, 2001],

cos(θ) =

∑ni=1AiBi√∑n

i=1A2i

√∑ni=1B

2i

When applied to document similarity it yields a value of 1 for documents with exactly

the same content. A value of 0 for documents not sharing any content. And a number

between 0 and 1 for other cases.

From the methods to detect duplicated websites, a naive one is to process the page’s

characters in order to generate a brief representation of it, its fingerprint. This shorter

10


version of each document is used for comparison purposes. When another website’s

fingerprint matches, their whole content is compared at the character level. If there is

an exact match they are flagged as duplicate and only one of them is preserved. Simple

as this method might seem, it has been well established and is core to several others.

Another approach to detecting duplicates, as expressed by Zhang, generates an abstract

for each document or website by using two strings plus a md5 hash for comparison

[Zhang and Zhang, 2012]. To form the abstract it uses the syntactic-free word-count

representation of each document, known as bag-of-words. The first string is generated

by concatenating the top representative words by their counts. Then, the second string

is generated by concatenating the top representative words again although sorted in

alphabetical order. These two strings are then paired with the md5 hash of the document.

A weakness of these and similar approaches comes when dealing not with exact dupli-

cates but with highly similar documents, presenting just slight variations in their content.

To exemplify this, it would be adequate to consider the case of two instances from the

same document which just differ on the date or time of publication. Both instances would

produce a distinct md5 hash or fingerprint while in fact their meaningful content must

be considered duplicated.

2.2.1 Shingling

An alternative solution to deal with this type of close similarity was published by

Broder as a technique known as shingling [Broder, 2000]. He proposes for each document

to be defined by k-shingles, or in other words, by the set of consecutive sequences of k

terms taken from the document. To help the reader get acquaintance with this technique,

an example taken from his work is reproduced here.

For this example we consider a very succinct document containing exactly the words

(a,rose,is,a,rose,is,a,rose). Its corresponding 4-shingling would be {(a,rose,is,a), (rose,is,a,rose),

(is,a,rose,is)}.

Once the shingles are computed the documents can be compared by computing their

Jaccard coefficient, defined by equation 2.4, if their score goes above an established

threshold, they can be flagged as duplicates.

j(A,B) =|SA ∩ SB||SA ∪ SB|

. (2.4)

11

2.2. DOCUMENT SIMILARITY

In terms of nomenclature, when dealing with alphabet letters shingles are also known

as q-gram, while w-shingling is the accepted reference to associate a document with its

set of shingles of size w.

As Broder explains, this technique can also be used for categorization tasks. A lim-

itation of this approach is the requirement for each cluster to be clearly defined and

separable from the rest. Then a document would be categorised as belonging to the

cluster which it shares the most shingles with. For our approach, this seems to be a less

desired method given the writing style differences between news and blog posts, and the

intuition of non-separability for some topics.

2.2.2 Rabin Fingerprints

Initially introduced by Rabin in 1981 as an efficient fingerprinting technique [Rabin,

1981], this method provides core functionality for the shingling algorithm as proposed by

Broder [Broder, 2000].

Rabin proposed to use randomly chosen irreducible polynomials to generate file finger-

prints to identify unauthorised local file modifications. According to this method, when

a file is created, its fingerprint F gets computed and is safely stored by a guardian. An-

other fingerprint F1 will be computed whenever file integrity needs to be assessed. Then,

by comparing both values it would be simple to know if a file was altered without proper

authorization, there would be a mismatch on the fingerprints F 6= F1.

2.2.3 Jaccard Coefficient

Commonly used to measure the similarity between two datasets, this coefficient is

based on the Jaccard index, previously introduced. As shown in equation 2.4, it is

the result of dividing the number of features common to both datasets, by the number

of properties. Jaccard similarity can also be computed by the inverse of the Jaccard

Coefficient [Suphakit Niwattanakul and Wanapu, 2013].

jd(A,B) = 1− j(A,B) =|SA ∪ SB| − |SA ∩ SB|

|SA ∪ SB|. (2.5)

Alongside document similarity detection, another key concept when dealing with web

content is Boilerplate.

12


2.3 Boilerplate

Common content found on websites and blogs is made up of diverse functional parts

accompanying the central content. Headers, footers, programmatic scripts, and nav-

igational aids, are commonly referred to as Boilerplate. They are charachterised by

seemingly uniform content across the website they belong to. Although useful to ease

human access, their prevalence makes them irrelevant for text processing tasks. Some ap-

proaches to deal with them rely on statistical measures to characterise them on a site by

site basis. Two models successful in this area are the GoldMiner algorithm and justText.

2.3.1 jusText

A freely available1 tool to remove bolierplate. It aims to preserv relevant textual

content by analysing the sentences length and the ratio of links and tags in a block of

content2. It is described as ”well suited for creating linguistic resources such as Web

corpora.”

By implementing justText in our project, we were able to avoid processing innecesary

content. Figure 2.1 shows a representation of boilerplet in a webage3. Section 4.2.1

contrasts the results of using this algorithm to remove the boilerplate from one of the

13227 individual blog sites in this project.

2.3.2 The GoldMiner algorthm

Presented as an improvement over the previously introduced jusText. Their achieve-

ment was due to the use of machine learning to better characterise boilerplate content

[Endredy and Novak, 2013].

Although its published performance seems promising, at the time of writing this dis-

sertation we could not find a publicly available Python implementation of Goldminer.

For this reaso,n we decided to use JusText.

1https://github.com/miso-belica/jusText2http://corpus.tools/wiki/Justext/Algorithm3Sample image taken from the project website at http://corpus.tools/attachment/wiki/Justext/nlp jusText fi.jpg

13

2.3. BOILERPLATE

Figure 2.1: Sample webpage showing the relevant content highlighted in gren and boil-

erplate in red.

14


2.4 Summary

In this chapter, we presented core concepts for our project. Topic modelling, Latent

Dirichlet Allocation, Gibbs Sampling, Topic Detection Tracking, Document Similarity,

and Shingling are part of the techniques underpinning this project.

We now proceed to present the methodology to follow.

15

3Methodology

In this chapter we describe our approach, first reviewing the aim of the project, then

detailing each step involved.

3.1 Project Aim

As mentioned in Chapter 1, the core aim of the project is to generate a collection of

links between online published news stories and blog posts covering the same topic or

event.

This linking will enable further analytical processing; the basic metrics evaluated are

precision, recall, and topic coverage.

To this end, after studying the publications relevant for the literature survey, we en-

visaged the following approach for this project:

1. Gold Standard. Given that no gold standard dataset exists, creating one was

paramount. Due to its high relevance for the project, section 3.3 is dedicated to

detailing how this process was performed. The resulting gold standard is available

in Appendix C

2. Inferring latent topics. To work with topic modelling techniques, we relied on the

LDA implementation available in the python package gensim.[Rehurek and Sojka,

2010]

16

CHAPTER 3. METHODOLOGY

Relevant Not Relevant

Retrieved A ∩B A ∩B B

Not Retrieved A ∩B A ∩B B

A A

Table 3.1: Precision and Recall pseudo-contingency table.

Model Setup. As previously stated, this model was trained on the NYT Anno-

tated Corpus. For this, a subset of the dataset was used. Corresponding to the

months of February and March, it contained 14,748 news articles.

Number of Topics. Given that the actual number of topics is unknown to us,

we decided to explore varying the number of target topics for each run, from 5

to 200 topics. Although we are aware of the active research being conducted on

automated number of topics identification, their application was judged as out of

scope for this project.

Hyperparameters for LDA. Alpha and Eta, are related to the term sparsity

for documents per topic, and for topics per token distributions, respectively. Both

were initialized with an unbiased prior of one over the number of target topics1. The

inferring method was online Variational Bayes. Finally, the number of iterations

on all cases was set to ten.

3. Evaluation. To test our model’s performance, we defined a baseline using cosine

similarity, then experimented how the variations in different parameters affected

the results. The metrics used for our exploration are Precision and Recall [van

Rijsbergen, 1979]. To illustrate the concept, table 3 shows the pseudo-contingency

table presented by van Rijsbergen [1979].

Precision The proportion of retrieved documents that are actually relevant.

What percentage of the documents retrieved and linked to topic n coincide with

the corresponding list of the gold standard. As shown in equation 3.1 referencing

elements from table 3.

Precision =|A ∩B||B|

(3.1)

1More details for these parameters can be read at: https://radimrehurek.com/gensim/models/ldamodel.html

17

3.2. THE DATASETS

Recall The proportion of relevant documents that were retrieved. What percent-

age of the documents in topic n from the gold standard were correctly retrieved

and linked. As shown in equation 3.2 referencing elements from table 3.

Recall =|A ∩B||A|

(3.2)

4. De-duplication. Although we intended to apply the shingling algorithm with the

aim of filtering highly similar blog posts prior to their linking; and thus decreas-

ing the processing time required for the linking step. We gave preference to the

exploration of topic linking.

3.2 The Datasets

After familiarising ourselves with the characteristics of the original file formats of our

datasets, we decided to take advantage of their common XML container format. To

simplify our task, we implemented custom readers to extract the information we needed.

For this, we leveraged the functionality of BeautifulSoup42 to process XML and HTML

content.

Then, we experimented with two approaches. The first one, based on generating

individual feature files for each news article and blog post. This was a convenient format

to allow direct access to each file. But with each additional feature adding about 4,000

files to the blog posts and 14,000 files to the news articles. This number of small-

sized files introduced technical issues when trying to use the Iceberg HPC cluster, or

other computersfor processing. Network errors and I/O overhead were among the most

common issues which throttled our process.

For the second approach, we opted for a file-based self-contained database, SQLite3.

This allowed us to keep just one large file per dataset with one record per document, and

a column per feature. Now, to add new features we just needed to create a new column

to contain it. This was a key move that enabled us to test additional features.

We now present the specific details about each dataset and how they were processed.

2https://www.crummy.com/software/BeautifulSoup/3SQLite homepage URL: https://sqlite.org/

18


3.2.1 The New York Times Annotated Corpus

The original corpus spans 20 years of the historical archive of the New York Times. It

includes almost every content they published from January 01, 1987 to June 19, 2007,

and is presented in cleanly formatted XML files enrichened with metadata following

version 3.3 of the News Industry Text Format (NITF) specification 4.[Sandhaus, 2008]

Our project focuses on the subset corresponding to their content generated in 2005.

Figure 3.1: Sample XML content from the NYT Annotated Corpus dataset.

For this specific corpus, a custom reader was developed to extract the relevant parts

for our goals, the news articles content. This was identified to be stored inside the tag

’<block class=”full text”>’.

Aside from the main article, other features were also extracted.

• The publication date, stored inside the tag <pubdata date.publication=”” />

• Its headlines, using the tag <hl1></hl1>

4Version 3.3 of the specification is available at: https://www.iptc.org/std/NITF/3.3/documentation/

19

3.2. THE DATASETS

• Taxonomic keywords, added by the NYT editors inside the tags <classifier class=””

type=”descriptor”>

Although these features were planned to be used for additional comparisons, time re-

strictions hindered this. We describe a plan to do so in section 5.2.

3.2.2 TREC Blogs 06

Originally created by The University of Glasgow as a Test Collection for a TREC task,

its aim was to be a realistic snapshot of the blogosphere long enough for events to be

recognisable. It covers topics such as news, sports, politics, and health. [Macdonald and

Ounis, 2006]

Figure 3.2: Sample XML content from the TREC Blogs 06 dataset.

20


Its data format differs from that of the NYT Annotated Corpus in three fundamental

aspects:

Figure 3.3: Screenshot showing the original data split of the blogs corpus.

1. It uses a 3-parts data split to represent each blog, as shown in figure 3.3. One

stores the blog’s static pages, the second one the dynamic content of the blog site,

and the third one a RSS5 feed for each site.

2. The content language is diverse and not identified.

3. Its content is in raw, non-normalized, format. By storing the full HTML files,

they preserved the originally crawled sites with complete embedded scripts, style

sheets, spam posts and comments. For our task this represents a copious amount

of boilerplate to remove from each blog post.

Additionally, we identified the following:

• Each feed, homepage and permalink were assigned a unique identifier which was

kept during the whole period.

• Permalinks are identified by a DOCNO of the form: BLOG06-20051206-000-0000026837.

• Their documentation also states each permalink was retrieved more than once. We

filtered the duplicates using the DOCNO identifier.

As a result, loading and extracting only the relevant information from each blog post,

its textual content, required a more sophisticated reader than the one used for the NYT

Annotated Corpus.

5RSS: Rich Site Summary

21

3.3. CREATING THE GOLD STANDARD

3.3 Creating the Gold Standard

Although, we anticipated this was a challenging task. The actual effort required to ac-

complish a representative dataset largely surpased our expectations. This section presents

two of the approaches we performed. Selected for being the most largely used versions

of the process, within our project. They enabled the collection of 181 News Articles and

161 Blog Posts.

After an initial exploration of the news articles, and looking to simplify this process.

We decided to focus our efforts on a list of broad topics: books, business, arts, educa-

tion, health, sexual preferences, social theories, human rights, journalism, civism, sports,

technology, terrorism, valentine’s day, military, politics, middle east.

Initial approach.

1. First, we decided on the timeframe to analyse, the months of February and March.

2. from the NYT corpus, we copied the documents corresponding to this time period

to a new location.

3. We randomly choose one of those news articles as our candidate for matching and

manually analysed its content. We tried to identify target topics covered and looked

for keywords on the text portions or on its headers. We also extracted its date of

publication.

4. Subtracted a subset of the blogs corpus covering one week prior to the date we just

defined.

5. Using POSIX terminal tools we searched for blogs containing the identified key-

words or topic-related words commonly associated with the observed topics.

6. The matching blog posts were opened in a web browser to observe their content.

This was necessary to dismiss spam posts or cases where the match was found only

in the comments.

7. Finally, each reference to a blog post or news article was saved in a list of documents

per topic, using a text file for this.

This procedure was highly time-consuming and the number of collected samples was

minimal. We found that most of the tentative links were in the form of spam comments

or boilerplate content. For that reason, a number of improvements were introduced until

we established a better performing process.

22


LDA 5 Topics LDA 8 Topics {5:3,8:5}1646589 Topic 3 Topic 5 1646589

1646894 Topic 3 Topic 5 1646894

1647470 Topic 1 Topic 5 {5:1,8:5}1654653 Topic 1 Topic 5 1647470

Table 3.2: Example of the topic linking per document, as specified in our gold standard

Improved approach

1. Using justText, we subtracted the non-boilerplate text from the blog posts gener-

ated on the previously selected time period.

2. Then, we generated a new feature. Using an implementation of the RAKE algo-

rithm we looked to identify the most relevant keywords for each blog post. [Rose

et al., 2010]

3. From the news articles produced in the months of February and March, we choose

one as a candidate. Its content was manually analysed in the previously described

way.

4. Using POSIX terminal tools again, we searched for matches on the recently gener-

ated keywords.

5. The matches were individually analysed using a text reader.

6. Finally, the reference to each blog post and news article was stored in the gold

standard index file.

By doing this, we streamlined our process by effectively reducing the number of non-

relevant tentative links that needed evaluation. After several iterations, the number of

documents in the gold standard increased considerably.

Once we finished this process we confronted another problem. How to conciliate our

identified topics with those inferred by LDA.

Given the unsupervised nature of LDA, we decided to settle for the simplest approach.

To observe the most relevant words in each of the LDA identified topics and align them

with the most resembling from the gold standard topics. By doing this we also needed

to generate custom gold standards to more closely match the number of target topics in

LDA. The reason for this is discussed in section 4.3.3

23

3.4. TEXT PREPROCESSING

The implemented solution was based on key-value pairs, where the keys are related to

the number of target topics for LDA, while the values specify the corresponding topic ID

for the documents. This allows for a single documents list to be used for the multiple

scenarios needed. Table 3.3 shows an example of the topic linking per documents and

its representation in the file format of our solution. Four documents are linked to three

different topics depending on the LDA model used. The flexibility gained with this

approach should be easily noticeable to the reader.

The completed gold standard is available in Appendix C.

Once we had a gold standard representative of the datasets, we were able to move

forward in the process. In order to get the data ready for its processing, it is fairly

common among NLP projects to introduce a preprocessing step.

3.4 Text Preprocessing

Our initial approach was to perform the following set of tasks on our data sources using

scikit-learn. Then, we experienced difficulties trying to generate features like n-grams

or k-skip-n-grams. For his reason, we recurred to Gensim. The rest of the project was

implemented based on this python package unless otherwise specified.

1. Source file reading. Each dataset file was traversed and processed according to

its peculiarities. After a series of steps involving data reading, target element

identification, and text extraction, we ended up with a one-line representation of

each document.

2. Tokenising. The obtained document’s content is separated into their minimal mean-

ingful representational elements, known as tokens. For this, we used NLTK to

identify the sentences and its token.

3. Stop-words. Terms which are commonly seen in the majority of the document from

a corpus lose relevancy compared to others less frequent. It is best to remove them

keeping only the most relevant. We used the NLTK’s English stop-word list for

this.

4. Lemmatizing. Removing inflectional word endings to obtain the proper form of a

word help to achieve a degree of text normalisation. For this, we use the WordNet

Lemmatizer.

24


5. Bag-of-Words. Each document’s tokens are converted to their bag-of-words repre-

sentation. This is the format required to proceed further processing.

Before continuing, an additional step was performed which should be noted. The

lemmatized tokens for each document were saved to our database in order to use these in

subsequent processing tasks. The same saving was done for the n-grams and skip-grams.

Now, an example of the transformation undergone by each document in our corpus is

shown here using one of the blogs posts. The original text:

[”This blog will chart the most popular British blogs, based on their stat

page figures. This means that those blogs who don’t have publicly accessible

stat pages don’t make the chart. Which means Samizdata, reputedly the

most popular British blog, won’t appear, and tons of others won’t either.

All I can say to them is, if you want to be in the Top Ten, put your stats

up, or make them publicly accessible. (Try Sitemeter, for example). If you

know of any British sites that have bigger figures than this, which I’ve missed,

let me know.Mail: britishblogs-at-yahoo.com The point of this site, I should

add, is not so much to create a fuss about viewing figures, because popularity

isn’t everything, but to provide a convenient and interesting place where new

UK blog readers might start from. Note: Although I couldn’t access The

Policeman’s Blog or Pootergeek’s Sitemeter figures, these figures were listed

at Truth Laid Bear. Speaking of Truth Laid Bear, why not use the ranking-

by-links method that Truth Laid Bear uses? Two reasons: (1) That system

seems to unfairly favour those who got into the blogosphere early. In fact,

many dead sites are still ranked highly on this method, because estalished

bloggers rarely update their blogrolls. (2) I can’t be bothered.”]

After applying all the preprocessing steps previously listed, the document becomes:

[blog, chart, popular, british, blog, base, stat, figure, blog, publicly, acces-

sible, stat, chart, samizdata, reputedly, popular, british, blog, win, appear,

ton, win, stats, publicly, accessible, sitemeter, example, british, site, bigger,

figure, mail, britishblogs, yahoo, site, add, create, fuss, view, figure, popu-

larity, provide, convenient, blog, reader, start, note, access, policeman, blog,

pootergeek, sitemeter, figure, figure, list, truth, laid, bear, speak, truth, laid,

bear, ranking, link, method, truth, laid, bear, reason, unfairly, favour, blogo-

25

3.5. PROCESSING

sphere, dead, site, rank, highly, method, estalished, blogger, rarely, update,

blogrolls, bother]

Finally, the bag-of-words representation gives the following sparse vector:

[(593, 1), (866, 1), (1083, 1), (1183, 1), (1533, 1), (2392, 1), (4024, 1),

(4284, 1), (5188, 1), (6209, 1), (6241, 1), (6887, 1), (7656, 1), (8526, 1),

(8637, 1), (8855, 1), (8905, 1), (9104, 1), (9199, 1), (10991, 3), (11524, 1),

(12183, 3), (12774, 6), (13211, 1), (14079, 3), (15356, 1), (16322, 3), (16434,

2), (16504, 1), (16726, 2), (16738, 2), (17137, 3), (17150, 1), (17201, 2),

(17348, 1), (17567, 1), (18157, 1), (18207, 1), (19032, 2), (19390, 1), (19497,

1), (19586, 5), (19707, 2), (19755, 1), (19939, 1), (20203, 1)]

Once our content is in this format we can continue with its processing.

3.5 Processing

Using the generated Bag-of-Words representations of our documents, we performed the

following evaluations.

3.5.1 Baseline - Cosine Distance

As previously stated the project baseline was created using the cosine distance between

documents. To do so, we set a threshold of 50% to identify a pair of documents as linked.

Then, if a blog post is linked to more than one news article, we sort the links based on

the computed score. That effectively creates a list of links between blogs post and news

article. Then, we save the results of these computations to a text file.

The process used to identify the topics using this linked pairs is detailed later in this

section. We were also interested in exploring how other features performed for the task.

3.5.2 Word overlap

With the intuition that documents covering the same topic have at least a slight ratio of

word overlap. We set a 10% threshold for the number of shared tokens between any pair

of news articles and blog post to be considered linked.

26


Following the same intuition but aiming to add lexical features to our comparison, we

explored k-skip-n-grams. These are token constructions similar to common bigram and

trigrams but they add an extra skip distance between their elements. We explored with

values of k=[0,3] and n=[2,3] setting a threshold of 5% for every variant6.

To estimate the overlap ratio we used equation 3.3:

Overlap =|commonfeatures| × 2

|newsarticlefeatures|+ |blogpostfeatures|(3.3)

From document similarity to topic links Given that the previously described fea-

tures directly compare only a pair of documents at a time, and not their topics. We

envisaged the following procedure in an effort to extrapolate this approach to explore

topic linking.

Using the saved results from each feature evaluated. We go over the list of linked

documents reading only the paired news articles corresponding to a single blog post.

The read pairs get stored to a temporal location and are ranked in descending order by

their computed score. Then, the highest ranked pair of IDs is individually looked-up in

the gold standard. If both are found under the same topic we consider them as a topical

match; using an auxiliary documents-per-topics index the pair gets assigned to the topic

id. Then, we clear the temporal storage and repeat the process by fetching the next blog

post with its corresponding news articles. Otherwise, if there is no match we read the

next news article ID from the temporal store and compare again in the same form. This

procedure is repeated until there are no pairs left in the results file to compare. At this

point, the mentioned auxiliary documents-per-topics index is compared against the gold

standard to compute the precision and recall values.

3.5.3 LDA

In the case of LDA, as the model directly infers the hidden topics in the news dataset, the

approach was slightly different. First, the saved lemmatized representation of the news

articles was loaded. This needs to be prepared in the format required by the algorithm.

A custom dictionary and a tokens-per-document matrix were generated. These are used

to train the LDA according to the parameters shown earlier.

6We also experimented with other threshold values, these presented here yielded the best results.

27

3.6. SUMMARY

We experimented with trimmed versions of the generated dictionary. The intuition

behind the trimmed version is to discard tokens present in more tan 50% of the documents

and words with less than 5 appearances. This way keeping only those most relevant

tokens.

After the LDA model was trained, it was used to get the tentative topics for the blog

post and news article. To do so each document was evaluated against the trained model.

To identify the topics in a document, the LDA algorithm compares which of the topics

per token distributions optimise the probability of the document under the inferred pos-

terior given the parameters. Then, it returns the topic index or indexes which maximised

this probability along the estimated probability.

The results obtained were saved to a text file. Using this file we were able to compare

what our model identified against the gold standard dataset. Again, to measure the

performance of our models we use the precision and recall metrics.

3.6 Summary

In this chapter, we described our project approach, reviewed our main aim and detailed

each step involved. We also described our datasets, the processing realised and how we

planed to compare the results.

We now proceed to show the experiments performed and their results.

28

4Experiments and Results

This chapter describes the experiments conducted, the observed results, our findings

and interpretations. In order to set the tone of this chapter, we introduce a description

of what we expected to observe on our testing.

4.1 Experimental Strategy

How can we identify documents covering the same topic? We explored dif-

ferent linking strategies. Firstly, we designed a special procedure aimed at scaling the

approaches based on document similarity to identify topic links. Secondly, in the case of

LDA its natural approach to identify topics turn it into a logic fit for this task.

What is an adequate method for topic identification? Judging by the number of

publications available, and the variety of research areas where it has been applied, LDA

is one of the most used models for this kind of tasks.

How to decide if a candidate link is relevant or not? From a quantitative per-

spective, we can define minimum thresholds in order to discriminate less relevant links.

From a qualitative perspective, our intuition was that a level of expertise in the area is

required to judge the validity of each link identified.

Is it possible to identify topical links if the documents of interest have a

largely different vocabulary or writing style between them? Our intuition was

that there needs to be a high similarity on the content of the documents covering the

same topic, otherwise, the automated method are not useful for this. We experimented

with different approaches to exploring the this feasibility.

29

4.2. EXPERIMENTS

Full Content No Boilerplate

Numer of blog posts in time period 4,201 2,124

Table 4.1: Number of blog posts to process, with and without filtering boilerplate

The rest of the chapter is devoted to presenting the explorations performed trying to

answer these questions.

4.1.1 Metrics for this evaluation.

As previously stated, we approached the evaluations using precision and recall.

Precission The proportion of retrieved documents that are actually relevant. What

percentage of the documents retrieved and linked to topic n coincide with the correspond-

ing list of the gold standard. As shown in equation 3.1 referencing elements from table

3.

Recall The proportion of relevant documents that were retrieved. What percentage of

the documents in topic n from the gold standard were correctly retrieved and linked. As

shown in equation 3.2 referencing elements from table 3.

4.2 Experiments

We now introduce the experiments conducted.

4.2.1 Removing boilerplate

By removing the boilerplate we effectively reduced the number of non-relevant content

which was processed. We analyse the impact this filtering has on the performance of our

approaches. Using tables 4.2.1 and 4.2.1 we present an initial quantitative comparison

on the impact this approach has on our linking efforts.

We also reproduce the content classification of a blog post with each line of its content

labelled as BP for boilerplate and TX for text.

BP—smaur (tangledwood) wrote,@ 2005-02-15 01:38:00

BP—Current mood:

BP—blown away

30

CHAPTER 4. EXPERIMENTS AND RESULTS

Blog post ID Full Content No Boilerplate

BLOG06-20051206-048-0010769450 506 336

BLOG06-20051206-048-0010794908 153 22

Table 4.2: Number of tokens to process in two documents, with and without filtering

boilerplate

BP—Current music:

BP—Motion City Soundtrack - the Future Freaks Me Out

BP—who is john galt?

Oh my god.

BP—Atlas Shrugged is just ...

BP—Read it. Now.

BP—(Post a new comment)

BP—ciarcimrene2005-02-15 06:55 am UTC(link)

BP—It really is a mind blowing book, isn’t it.

BP—I read it a year or two ago, in one night. Ayn Rand was quite a person.

BP—Haven’t heard from you in a while, hope you’re doing good and writing lots! :)

BP—(Reply to this) (Thread)

BP—tangledwood2005-02-15 12:13 pm UTC(link)

TX—It took me a week to read, mostly due to school and homework and stupid things

like that. But – wow, it was one heckuva book. I haven’t read any of her other books

(although I’m starting Anthem some time soon). But she’s a genius.

TX—Also: I haven’t been online a lot, due to copious amounts of schoolwork (semester

one just finished). And I need to get cracking on the writing thing.

BP—(Reply to this) (Parent) (Thread)

BP—ciarcimrene2005-02-15 01:49 pm UTC(link)

BP—Yes, you do need to! :P

BP—(Reply to this) (Parent)

BP—Log in now.

BP—(Create account, or use OpenID)

BP—forget your login?

BP—remember me

BP—Search: Category:

BP—Create an Account

BP—Update Your Journal

BP—Gift Shop

31

4.2. EXPERIMENTS

BP—Paid Accounts

BP—General Info

BP—Site News

BP—Paid Accounts

BP—Ask a Question

BP—Lost Password

BP—Site Map

BP—Browse Options

BP—Contact Info

BP—Terms of Service

BP—Privacy Policy

BP—Legal Information

BP—Site Map

BP—Browse Options

Now to compare the impact this filtering has in textual content, we present the result

of preprocessing the full textual content.

[’tangledwood’, ’john’, ’galt’, ’smaur’, ’tangledwood’, ’write’, ’2005’, ’current’, ’mood’,

’blow’, ’current’, ’music’, ’motion’, ’city’, ’soundtrack’, ’future’, ’freak’, ’john’, ’galt’,

’god’, ’atlas’, ’shrugged’, ’...’, ’wow’, ’read’, ’post’, ’comment’, ’ciarcimrene’, ’2005’, ’utc’,

’link’, ’mind’, ’blow’, ’book’, ’read’, ’night’, ’ayn’, ’rand’, ’person’, ’heard’, ’hope’, ’write’,

’lot’, ’reply’, ’thread’, ’tangledwood’, ’2005’, ’utc’, ’link’, ’week’, ’read’, ’school’, ’home-

work’, ’stupid’, ’wow’, ’heckuva’, ’book’, ’read’, ’book’, ’start’, ’anthem’, ’time’, ’genius’,

’online’, ’lot’, ’copious’, ’schoolwork’, ’semester’, ’finish’, ’crack’, ’write’, ’reply’, ’par-

ent’, ’thread’, ’ciarcimrene’, ’2005’, ’utc’, ’link’, ’reply’, ’parent’, ’log’, ’create’, ’account’,

’openid’, ’username’, ’password’, ’forget’, ’login’, ’remember’, ’search’, ’category’, ’site’,

’username’, ’site’, ’username’, ’mail’, ’region’, ’aol’, ’icq’, ’yahoo’, ’msn’, ’username’, ’jab-

ber’, ’navigate’, ’login’, ’create’, ’account’, ’update’, ’journal’, ’english’, ’espa’, ’deutsch’,

’’, ’search’, ’random’, ’region’, ’advanced’, ’school’, ’gift’, ’shop’, ’gift’, ’merchandise’,

’paid’, ’account’, ’add’, ’ons’, ’info’, ’press’, ’download’, ’site’, ’news’, ’paid’, ’account’,

’help’, ’question’, ’lost’, ’password’, ’faq’, ’site’, ’map’, ’browse’, ’option’, ’contact’, ’info’,

’term’, ’service’, ’privacy’, ’policy’, ’legal’, ’site’, ’map’, ’browse’, ’option’]

In comparison, the boilerplate free content after preprocessing becomes...

32


Feature Precission Recall

Cosine distance 19% 19%

Word overlap 9% 10%

k-spkip-n-grams 3% 5%

Table 4.3: Baseline and first features compared.

[’week’, ’read’, ’school’, ’homework’, ’stupid’, ’wow’, ’heckuva’, ’book’, ’read’, ’book’,

’start’, ’anthem’, ’time’, ’genius’, ’online’, ’lot’, ’copious’, ’schoolwork’, ’semester’, ’finish’,

’crack’, ’write’].

Upon initial observation, we argue removing boilerplate has a large affectation on the

data. Tables 4.4, 4.5, 4.6, and 4.7 compare the difference in performance that removing

boilerplate had on LDA.

4.3 Results

4.3.1 Cosine distance and word overlap

Our baseline was defined by the word overlap measure. Although we were surprised

by the low number of matches found by the k-skip-n-grams and higher n-grams. It seems

logical given the differences in grammar and writing styles observed between the blog

post and the news articles. The obtained performance is shown on Table 4.3.1.

4.3.2 LDA model

For LDA we tested the following scenarios: - Using only the values computed by

the gensim implementation of LDA. Then, take the highest ranked topic as the correct,

regardless of its value. - Setting a threshold of 30% to keep the number of documents with

multiple links low. - Setting a threshold of 15% as the minimum value to consider multiple

links. Then, using Kullback-Leibler as a second metric, the linkings are evaluated again.

If the highest scoring value is different from the original, both are kept. This was the

best performing achieved and is the one shown in the different tables comparing LDA.

As it is known, Kullback Leibler divergence computes the cost of encoding using a

distribution P while the real distribution is Q. In our case, that would be the cost of

using distribution P to generate a document instead of using distribution Q, which is the

”real” one.

33

4.3. RESULTS

Image 4.1 shows how the perplexity evolved during training. The observed evolution

could be explained as starting with les relevant topics, or more mixed topics, and then

through the iterations they become more specialized or meaningful.

Figure 4.1: Perplexity during training

Appendix D shows how the most frequent words in a topic change over training

iterations.

4.3.3 Observations on the Gold Standard

Our version of the gold standard was targeted to 5 topics, As we progressed to other

number of target topics, the precision and recall value droped. So, we needed to create

different versions of the golden standard focusing on matching the number of target

topics. This represented more work while also increases the chance of errors.

For this reason it would be wise to involve more people to improve the quality of the

gold standard.

4.3.4 Topic Coherence

A naive approach to analyzing topic coherence was performed by simple visualization

of the topics. To do so we used LDAVis [Carson Sievert, 2014]... An observed issue

was the seemingly duplication of topics as we increased the number of target topics.

Image 4.3 and 4.4 show this phenomena. This is one of the common errors described by

[Boyd-Graber et al., 2014]

Another issue observed in some of the intermediate trainings performed was the ex-

plained by [Hu et al., 2014] as ”bad” topics which do not make sense from the users’

perspective. These bad topics can confuse two or more themes into one topic; two differ-

ent topics can be (near) duplicates; and some topics make no sense at all.”

34


Topic 1 Topic 2 Topic 3 Topic 4 Topic 5

city 0.021 march 0.012 company 0.021 tax 0.015 000 0.039

percent 0.010 2005 0.011 executive 0.012 security 0.014 bedroom 0.019

york 0.009 family 0.010 business 0.009 budget 0.012 house 0.016

mayor 0.007 love 0.009 chief 0.007 social 0.011 building 0.016

plan 0.006 wife 0.009 news 0.006 bush 0.011 market 0.016

stadium 0.006 york 0.009 medium 0.005 plan 0.009 bathroom 0.015

project 0.005 service 0.008 network 0.005 cut 0.009 tax 0.014

bloomberg 0.005 beloved 0.007 president 0.005 program 0.008 list 0.013

price 0.005 friend 0.007 television 0.005 benefit 0.008 broker 0.013

official 0.004 father 0.006 time 0.004 billion 0.007 week 0.012


police 0.013 art 0.013 american 0.010 court 0.027 jackson 0.015

child 0.010 museum 0.012 war 0.010 judge 0.023 boy 0.011

hospital 0.008 artist 0.007 military 0.008 law 0.014 police 0.007

medical 0.008 book 0.007 army 0.007 lawyer 0.013 day 0.006

family 0.007 design 0.005 official 0.007 federal 0.010 family 0.006

officer 0.006 york 0.004 report 0.007 justice 0.008 time 0.005

people 0.006 collection 0.004 iraq 0.007 schiavo 0.007 sign 0.005

doctor 0.006 time 0.004 intelligence 0.006 trial 0.006 woman 0.005

health 0.005 gallery 0.003 soldier 0.005 prosecutor 0.006 mother 0.004

care 0.005 painting 0.003 united 0.005 legal 0.006 rhp 0.004


government 0.012 win 0.017 president 0.011 church 0.027 airline 0.019

official 0.010 tournament 0.013 bush 0.011 book 0.014 flight 0.013

american 0.010 play 0.011 republican 0.009 god 0.008 plane 0.009

iraq 0.010 race 0.010 political 0.007 religious 0.007 travel 0.009

iraqi 0.009 lead 0.009 party 0.007 life 0.007 passenger 0.008

country 0.008 team 0.009 democrat 0.007 christian 0.006 airport 0.007

election 0.007 time 0.007 house 0.005 oil 0.006 ticket 0.007

united 0.007 victory 0.006 issue 0.005 religion 0.005 air 0.007

minister 0.006 final 0.006 leader 0.005 people 0.005 fly 0.007

shiite 0.005 seed 0.006 democratic 0.005 write 0.004 seat 0.006

Table 4.4: Top 10 frequent words in the first 15 topics of the 25 topics LDA model

35

4.3. RESULTS

Full Text Non Boilerplate

Topic Gold Std Ret Rel Prec Rec Ret Rel Prec Rec

1 16 33 0 0 0 44 16 0 0

2 8 87 0 0 0 86 8 0 0

3 9 150 2 1.3 22.2 134 9 1.4 22.2

4 220 170 74 43.5 33.6 158 220 39.8 28.6

5 51 7 2 28.5 3.9 7 51 28.5 3.9

Table 4.5: Performance for the LDA model trained on 5 topics



1 -

2 18 171 4 2.3 22.2 157 3 1.9 16.6

3 35 49 1 2.0 2.8 43 1 2.3 2.8

4 168 42 8 19.0 4.7 49 14 28.5 8.3

5 -

6 70 32 9 28.1 12.8 37 9 24.3 12.8

7 29 166 3 1.8 10.3 156 2 1.2 6.8

8 19 39 6 15.3 31.5 46 8 17.3 42.1




1 9 9 0 0 0 14 0 0 0

2 9 34 0 0 0 37 0 0 0

3 110 40 8 20.0 7.2 40 10 25 9.0

4 36 126 6 4.7 16.6 124 8 6.4 22.2

5 13 22 0 0 0 24 0 0 0

6 25 63 1 1.5 4.0 60 0 0 0

7 -

8 28 80 7 8.7 25.0 76 7 9.2 25

9 25 55 1 1.8 4 58 1 1.7 4

10 55 94 15 15 27.2 77 6 7.7 10.9


36



Topics Precission Recall Precission Recall

5 17.4% 25.6% 15.6% 22.0%

8 6.2% 9.1% 6.5% 10.9%

10 7.2% 12.2% 6.2% 10.3$

Table 4.8: General precission and recall achieved by each LDA model.

Figure 4.2: Visual analysis of the infered topics

4.3.5 Additional observations

When using the multithreaded implementation of the LDA model attempring to train

on the full news corpus. Which activates batch training using Gibbs sampling. We

contantly ran into memory problems, there was an abnormally high RAM demand of

over 45 GB. Those issues forced us to abandon this type of training.

4.4 Summary

In this chapter, we presented our experimental design and results accomplished. Per-

fomance metrics and observations were also shown.

37

4.4. SUMMARY

Figure 4.3: Visualization of duplicated topics.

Now we proceed to present our conclussion and future work.

38


Figure 4.4: The topics being duplicated seemed to increase as we targeted a higher

number of them during training.

39

5Conclusion and Future Work

For the final chapter of this document we present our conclusion to this report, on the

other hand, we also mention possible future work.

5.1 Conclusion

Our experiments show LDA is an algorithm well oriented for this kind of task. Although

it is possible to use this method to link the blog posts to the news articles, the performance

achieved sugests it would be necessary to use additional features to improve what we

achieved. Otherwise the result would still seem to be far from perfect.

With the generated gold standard we hope to contribute to the advancement of this

research area.

5.2 Future Work

Based on the exploration we have done, we propose to expand this work by using the

extra features extracted from the corpus. Namely the news headlines, obtained from the

tags <hl1></hl1>Also the taxonomic keywords, added by the NYT editors and stored

inside the tags <classifier class=”” type=”descriptor”>.

It would also be relevant to explore using the syntactical parses of the documents.

That should enable retrieving more relevant documents.

As we also said in section 4. The gold standard should be improved by adding multiple

topic links per document and improving the categorization of the links by involving

multiple taggers or multiple evaluators.

40

CHAPTER 5. CONCLUSION AND FUTURE WORK

Also a relevant approach would be to evaluate based on using multiple features.

Perhaps by linking based on a voting comitee.

41

Bibliography

[Battisti et al., 2015] Francesca Battisti, Alfio Ferrara, and Silvia Salini. A decade of

research in statistics: a topic model approach. Scientometrics, 103(2):413–433, 2015.

[Blei, 2012] David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84,

Apr 2012.

[Boyd-Graber et al., 2014] Jordan Boyd-Graber, David Mimno, and David Newman.

Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC

Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, Florida, 2014.

[Broder, 2000] Andrei Z. Broder. Identifying and filtering near-duplicate documents. In

Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, COM

’00, pages 1–10, LondoIMECS 2013, March 13 - 15, 2013IMECS 2013, March 13 - 15,

2013n, UK, UK, 2000. Springer-Verlag.

[Carson Sievert, 2014] Kenneth E. Shirley Carson Sievert. Ldavis: A method for visual-

izing and interpreting topics. In Proceedings of the Workshop on Interactive Language

Learning, Visualization, and Interfaces, ILLVI 2014, pages 63–70, Baltimore, Mary-

land, USA, 2014.

[Endredy and Novak, 2013] Istvan Endredy and Attila Novak. More Effective Boiler-

plate Removal - the GoldMiner Algorithm. Polibits - Research journal on Computer

science and computer engineering with applications, (48):79–83, 2013.

42

BIBLIOGRAPHY

[Fiscus and Doddington, 2002] Jonathan G. Fiscus and George R. Doddington. Topic

detection and tracking: Event-based information organization. chapter Topic Detection

and Tracking Evaluation Overview, pages 17–31. Springer US, Boston, MA, 2002.

[Han et al., 2013] Xiaohui Han, Jun Ma, Yun Wu, and Chaoran Cui. A novel machine

learning approach to rank web forum posts. Soft Computing, 18(5):941–959, 2013.

[Hoffman et al., 2010] Matthew D. Hoffman, David M. Blei, and Francis Bach. Online

learning for latent dirichlet allocation. In In NIPS, 2010.

[Hu et al., 2013] Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith.

Interactive topic modeling. Machine Learning, 95(3):423–469, 2013.

[Hu et al., 2014] Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith.

Interactive topic modeling. Machine Learning, 95:423–469, 2014.

[Macdonald and Ounis, 2006] Craig Macdonald and Iadh Ounis. The trec blog06 collec-

tion: Creating and analysing a blog test collection. Technical report, Department of

Computing Science, University of Glasgow, Scotland, United Kingdom, 2006.

[Madani et al., 2015] Amina Madani, Omar Boussaid, and Djamel Eddine Zegour. Real-

time trending topics detection and description from twitter content. Social Network

Analysis and Mining, 5(1):1–13, 2015.

[Manning et al., 2008] Christopher D. Manning, Prabhakar Raghavan, and Hinrich

Schutze. Introduction to Information Retrieval. Cambridge University Press, New

York, NY, USA, 2008.

[Newman and of Oxford. Reuters Institute for the Study of Journalism, 2009] N. New-

man and University of Oxford. Reuters Institute for the Study of Journalism. The

Rise of Social Media and Its Impact on Mainstream Journalism: A Study of how

Newspapers and Broadcasters in the UK and US are Responding to a Wave of Partic-

ipatory Social Media, and a Historic Shift in Control Towards Individual Consumers.

Working paper (Reuters Institute for the Study of Journalism). University of Oxford,

Reuters Institute for the Study of Journalism, 2009.

[Qi et al., 2015] Xiang Qi, Yu Huang, Ziyan Chen, Xiaoyan Liu, Jing Tian, Tinglei

Huang, and Hongqi Wang. Burst-lda: A new topic model for detecting bursty topics

from stream text. Journal of Electronics (China), 31(6):565–575, 2015.

43

BIBLIOGRAPHY

[Rabin, 1981] M.O. Rabin. Fingerprinting by Random Polynomials. Center for Research

in Computing Technology: Center for Research in Computing Technology. Center for

Research in Computing Techn., Aiken Computation Laboratory, Univ., 1981.

[Rehurek and Sojka, 2010] Radim Rehurek and Petr Sojka. Software framework for topic

modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New

Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.

[Rose et al., 2010] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. Auto-

matic keyword extraction from individual documents. In Michael W. Berry and Jacob

Kogan, editors, Text Mining. Applications and Theory, pages 1–20. John Wiley and

Sons, Ltd, 2010.

[Salton et al., 1975] G. Salton, A. Wong, and C. S. Yang. A vector space model for

automatic indexing. Commun. ACM, 18(11):613–620, November 1975.

[Sandhaus, 2008] E. Sandhaus. The new york times annotated corpus. Linguistic Data

Consortium, Philadelphia, 6(12), 2008.

[Singhal, 2001] Amit Singhal. Modern Information Retrieval: A Brief Overview. Bulletin

of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35–42,

2001.

[Stilo and Velardi, 2016] Giovanni Stilo and Paola Velardi. Efficient temporal mining of

micro-blog texts and its application to event discovery. Data Mining and Knowledge

Discovery, 30(2):372–402, 2016.

[Suphakit Niwattanakul and Wanapu, 2013] Ekkachai Naenudorn Suphakit Niwat-

tanakul, Jatsada Singthongchai and Supachanun Wanapu. Using of jaccard coeffi-

cient for keywords similarity. In Proceedings of the International MultiConference of

Engineers and Computer Scientists, IMECS 2013, 2013.

[Tristan et al., 2015] Jean-Baptiste Tristan, Joseph Tassarotti, and Guy L. Steele Jr.

Efficient training of lda on a gpu by mean-for-mode estimation. In Francis R. Bach

and David M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 59–68.

JMLR.org, 2015.

[van Rijsbergen, 1979] C.J. van Rijsbergen. Information Retrieval. 1979.

[Vorontsov and Potapenko, 2014] Konstantin Vorontsov and Anna Potapenko. Additive

regularization of topic models. Machine Learning, 101(1):303–323, 2014.

44

BIBLIOGRAPHY

[Wang et al., 2015] Chi Wang, Jialu Liu, Nihit Desai, Marina Danilevsky, and Jiawei

Han. Constructing topical hierarchies in heterogeneous information networks. Knowl-

edge and Information Systems, 44(3):529–558, 2015.

[Yan et al., 2013] Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. A biterm

topic model for short texts. In Proceedings of the 22Nd International Conference on

World Wide Web, WWW ’13, pages 1445–1456, New York, NY, USA, 2013. ACM.

[Zhang and Zhang, 2012] Yong-Heng Zhang and Feng Zhang. In De-Shuang Huang, Jian-

hua Ma, Kang-Hyun Jo, and M. Michael Gromiha, editors, Intelligent Computing The-

ories and Applications: 8th International Conference, ICIC 2012, Huangshan, China,

July 25-29, 2012. Proceedings, chapter Research on New Algorithm of Topic-Oriented

Crawler and Duplicated Web Pages Detection, pages 35–42. Springer Berlin Heidel-

berg, Berlin, Heidelberg, 2012.

45

Appendices

46

A Blogs 06 full XML sample

<DOC>

<DOCNO>BLOG06-20051206-005-0002264497</DOCNO>

<DATE_XML>2005-02-04T17:17:57+0000</DATE_XML>

<FEEDNO>BLOG06-feed-000428</FEEDNO>

<FEEDURL>http://changeforkentucky.com/index.rdf#</FEEDURL>

<BLOGHPNO>BLOG06-bloghp-000428</BLOGHPNO>

<BLOGHPURL>http://changeforkentucky.com/#</BLOGHPURL>

<PERMALINK>http://changeforkentucky.com/archives/000184.html#</PERMALINK>

<DOCHDR>

http://changeforkentucky.com/archives/000184.html# 0.0.0.0 20051220205822 8569

Date: Tue, 20 Dec 2005 20:58:21 GMT

Accept-Ranges: bytes

ETag: "1b12d4-2005-8688dc00"

Server: Apache/2.0.50 (FreeBSD) DAV/2 SVN/1.0.5

Content-Length: 8197

Content-Type: text/html; charset=ISO-8859-1

Last-Modified: Fri, 25 Feb 2005 16:36:00 GMT

Client-Date: Tue, 20 Dec 2005 20:58:22 GMT

Client-Response-Num: 1

Proxy-Connection: close

X-Cache: MISS from paya.dcs.gla.ac.uk

</DOCHDR>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

<title>Change for Kentucky: Change for Kentucky Day at the Capitol</title>

<link rel="stylesheet" href="http://changeforkentucky.com/styles-site.css"

type="text/css" />

<link rel="alternate" type="application/rss+xml" title="RSS"

href="http://changeforkentucky.com/index.rdf" />

<link rel="start" href="http://changeforkentucky.com/" title="Home" />

<link rel="prev" href="http://changeforkentucky.com/archives/000183.html"

title="From the Ground Up" />

<link rel="next" href="http://changeforkentucky.com/archives/000185.html"

title="Meet Up is Wednesday, March 2, 2005" />

<script type="text/javascript" language="javascript">

47

A. BLOGS 06 FULL XML SAMPLE



</script>



49

A. BLOGS 06 FULL XML SAMPLE

</head>

<body>

<div id="banner">

<h1><a href="http://changeforkentucky.com/" accesskey="1">Change for

Kentucky</a></h1>

<span class="description">Change for Kentucky</span>

</div>

<div id="container">

<div class="blog">

<div id="menu">

<a href="http://changeforkentucky.com/archives/000183.html">« From the

Ground Up</a> |

<a href="http://changeforkentucky.com/">Main</a>

| <a href="http://changeforkentucky.com/archives/000185.html">Meet Up is

Wednesday, March 2, 2005 »</a>

</div>

</div>

<div class="blog">

<h2 class="date">February 04, 2005</h2>

<div class="blogbody">

<h3 class="title">Change for Kentucky Day at the Capitol</h3>

<p>Join fellow Change for Kentucky members from around the state for a day<br />

at the state capital in Frankfort, Thursday, February 17, 2005. </p>

<p><img src="http://changeforkentucky.com/archives/frankcap.gif" width="275"

height="205" border="0" /></p>

<p>The day will be designed to make us more effective advocates. We will<br />

learn our way around, observe committee meetings scheduled for the day,<br />

meet with legislators and watch the Senate and House in session from<br />

the chamber galleries. </p>

<p>Please be thinking about whether you might be able to join us for at<br />

least part of the day. It will be a good opportunity for you to meet<br />

other Change for Kentucky supporters from across the state.</p>

<p>Contact <a href="mailto:[email protected]">Libby Marshall</a> or <a

href="mailto:[email protected]">Lacey McNary</a> from CFK/Frankfort for

more information, of if you would like to volunteer to be a host or

guide.</p>

<p><br />

</p>

<a name="more"></a>

<span class="posted">Posted by Allison Webster at February 4, 2005 12:17 PM

<br /></span>

50

</div>

<div class="comments-head"><a name="comments"></a>Comments</div>

<div class="comments-head">Post a comment</div>

<div class="comments-body">

<form method="post" action="http://changeforkentucky.com/mt/mt-comments.cgi"

name="comments_form" onsubmit="if (this.bakecookie[0].checked)

rememberMe(this)">

<input type="hidden" name="static" value="1" />

<input type="hidden" name="entry_id" value="184" />

<div style="width:180px; padding-right:15px; margin-right:15px; float:left;

text-align:left; border-right:1px dotted #bbb;">

<label for="author">Name:</label><br />

<input tabindex="1" id="author" name="author" /><br /><br />

<label for="email">Email Address:</label><br />

<input tabindex="2" id="email" name="email" /><br /><br />

<label for="url">URL:</label><br />

<input tabindex="3" id="url" name="url" /><br /><br />

</div>



<label for="code">Security Code:</label><br />

<input type="hidden" id="code" name="code" value="42" />

<img border="0" src="http://changeforkentucky.com/mt/mt-scode.cgi?code=42

"><br />

<input tabindex=3 id="scode" name="scode" /><br /><br />



Remember personal info?<br />

<input type="radio" id="bakecookie" name="bakecookie" /><label

for="bakecookie">Yes</label><input type="radio" id="forget"

name="bakecookie" onclick="forgetMe(this.form)" value="Forget Info"

style="margin-left: 15px;" /><label for="forget">No</label><br

style="clear: both;" />

<label for="text">Comments:</label><br />

<textarea tabindex="4" id="text" name="text" rows="10" cols="50"></textarea><br

/><br />

<input type="submit" name="preview" value=" Preview " />

<input style="font-weight: bold;" type="submit" name="post"

value=" Post " /><br /><br />

</form>

<script type="text/javascript" language="javascript">



</script>

</div>

</div>

</div>

</body>

</html>

</DOC>

B NYT 06 Annotated Corpus full XML sample

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE nitf SYSTEM

"http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">

<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD

NITF 3.3//EN">

<head>

<title>National Briefing | Washington: Energy Secretary Confirmed</title>

<meta content="MB012055" name="slug"/>

<meta content="1" name="publication_day_of_month"/>

<meta content="2" name="publication_month"/>

<meta content="2005" name="publication_year"/>

<meta content="Tuesday" name="publication_day_of_week"/>

<meta content="National Desk" name="dsk"/>

<meta content="14" name="print_page_number"/>

<meta content="A" name="print_section"/>

<meta content="3" name="print_column"/>

<meta content="U.S." name="online_sections"/>

<docdata>

<doc-id id-string="1646630"/>

<doc.copyright holder="The New York Times" year="2005"/>

<identified-content>

52

<org class="indexing_service">Energy Department</org>

<person class="indexing_service">Hulse, Carl</person>

<person class="indexing_service">Bodman, Samuel W</person>

<classifier class="online_producer"

type="taxonomic_classifier">Top/News/U.S.</classifier>

</identified-content>

</docdata>

<pubdata date.publication="20050201T000000"

ex-ref="http://query.nytimes.com/gst/fullpage.html?res=9A03E6D9123BF932A35751C0A9639C8B63"

item-length="168" name="The New York Times" unit-of-measure="word"/>

</head>

<body>

<body.head>

<hedline>

<hl1>National Briefing | Washington: Energy Secretary Confirmed</hl1>

</hedline>

<byline class="print_byline">By Carl Hulse (NYT)</byline>

<byline class="normalized_byline">Hulse, Carl</byline>

<abstract>

<p>Senate confirms new Energy Secretary Samuel W Bodman by voice vote; photo

(S)</p>

</abstract>

</body.head>

<body.content>

<block class="lead_paragraph">

<p>The Senate confirmed the new energy secretary, Samuel W. Bodman, left, by

voice vote, completing a smooth trip through the Senate by a man who

previously held senior positions in the Commerce and Treasury Departments.

A one-time chemical engineering professor, Mr. Bodman replaces Spencer

Abraham as the head of the agency responsible for a energy issues and

research. He is expected to play a role as the House and Senate try once

again to come to terms on wide-ranging energy legislation and decide

whether to allow drilling in the Arctic National Wildlife Refuge.</p>

<p>Carl Hulse (NYT)</p>

</block>

<block class="full_text">

<p>The Senate confirmed the new energy secretary, Samuel W. Bodman, left, by

voice vote, completing a smooth trip through the Senate by a man who

previously held senior positions in the Commerce and Treasury Departments.

A one-time chemical engineering professor, Mr. Bodman replaces Spencer

Abraham as the head of the agency responsible for a energy issues and

53

C. GOLD STANDARD FILE

research. He is expected to play a role as the House and Senate try once

again to come to terms on wide-ranging energy legislation and decide

whether to allow drilling in the Arctic National Wildlife Refuge.</p>

<p>Carl Hulse (NYT)</p>

</block>

</body.content>

</body>

</nitf>

C Gold Standard file

;# books

# {8:6,10:4}

1647594

1647608

1652294

1654651

1654657

1654660

1654664

1660017

1660654

BLOG06-20051207-015-0004669977 | 20050219T140600

BLOG06-20051207-027-0003414242 | 20050303T201400

BLOG06-20051206-053-0022922844 | 20050325T003546

BLOG06-20051206-053-0022554608 | 20050331T213525

; business

# {5:2,8:1,10:0}

1647482

1652396

1656251

1656311

1658134

1661166

BLOG06-20051207-000-0018149515 | 20050209T144119

BLOG06-20051206-005-0012663571 | 20050211T042100

BLOG06-20051207-033-0011353957 | 20050321T120100

54

; arts/plays, movies, music

# {8:1,10:1}

; concerts | gigs

BLOG06-20051206-012-0011643299 | 20050323T162200

;# movies

1646773

1649017

1649475

1652996

1653775

1655061

BLOG06-20051206-017-0012074606 | 20050306T224722

;# music

1660506

; education 5

# {5:4,8:5,10:8}

1647660

1651293

BLOG06-20051207-033-0016323399 | 20050207T082032

BLOG06-20051207-033-0016307364 | 20050207T082040

;# health

# {5:4,8:5,10:7}

1646954

1652377

1653182

1658953

1660654

1660672

BLOG06-20051206-022-0010119325 | 20050209T012300

BLOG06-20051207-041-0014474902 | 20050209T090454

BLOG06-20051206-022-0010088187 | 20050210T021800

BLOG06-20051206-019-0007672170 | 20050210T131400

BLOG06-20051206-019-0007656300 | 20050210T135700

BLOG06-20051207-099-0004345988 | 20050211T140051

;# health | abortion

1646854

1647555

1647660

1647942

55


1647944

1647946

1648099

1648281

1649117

1649475

1650385

1650596

1654325

1654971

BLOG06-20051207-041-0014368762 | 20050209T152507

BLOG06-20051207-062-0017788843 | 20050311T102600

; sexual preferences

;# marriage/legislation/rights

# {5:3,8:5,10:5}

1646589

1646894

1647017

1647318

1647415

1647594

1647608

1647969

1648292

1648402

BLOG06-20051206-008-0019851299 | 20050202T155600

BLOG06-20051207-056-0016211866 | 20050204T174106

BLOG06-20051207-056-0015976383 | 20050207T135042

BLOG06-20051207-056-0015924528 | 20050207T185742

BLOG06-20051207-056-0015685587 | 20050208T141601

BLOG06-20051207-056-0015613582 | 20050208T144553

BLOG06-20051207-041-0014325770 | 20050210T191520

; social theories, religion

# {5:1,8:5,10:5}

1647470

1654653

1658064

1658327

BLOG06-20051207-000-0018070715 | 20050225T135600

56

BLOG06-20051207-000-0018043942 | 20050301T151800

;# religion

BLOG06-20051206-003-0021250584 | 20050326T145800

BLOG06-20051206-003-0021215450 | 20050327T125300

; human rights

# {5:3,8:3,10:8}

1658961

1659080

1660940

1660972

BLOG06-20051207-056-0016282311 | 20050204T145144

; journalism

1648754

1649276

1649918

1650628

BLOG06-20051207-037-0018070824 | 20050201T164829

BLOG06-20051207-043-0003056710 | 20050208T142348

BLOG06-20051207-093-0014990657 | 20050208T211923

BLOG06-20051207-030-0004949018 | 20050215T193400

BLOG06-20051207-037-0017099181 | 20050216T044100

;# law | civics | city | court, judge, major | human rights | same sex marriage

# {5:3,8:3,10:8}

1649714

1652379

1659115

1660723

BLOG06-20051207-025-0018688681 | 20050306T205200

BLOG06-20051206-014-0006802692 | 20050317T023921

BLOG06-20051206-014-0006708448 | 20050323T163334

; military and politics go here

; sports - baseball

# {5:0,8:6,10:6}

1650417

1652433

1658187

57


BLOG06-20051207-067-0014499680 | 20050328T155000

BLOG06-20051207-067-0014432379 | 20050331T183300

; sports - super bowl

1647036

1648054

1648065

1648072

1648133

1648171

1648179

1648182

1658363

BLOG06-20051207-056-0016187067 | 20050204T183914

BLOG06-20051207-056-0016081103 | 20050206T193146

;# technology | mobile phones | blogs

# {8:5}

1658775

1661144

BLOG06-20051207-090-0016847958 | 20050316T094649

BLOG06-20051207-090-0016817643 | 20050316T095358

; blogs

1648060

1648733

1649545

1652417

1656964

BLOG06-20051207-100-0025856345 | 20050209T174600

BLOG06-20051207-010-0001842471 | 20050227T110200

BLOG06-20051207-001-0010918632 | 20050301T050900

BLOG06-20051206-003-0021128252 | 20050330T025100

; terrorism

;# politics | terrorism

# {5:3,8:2,10:9}

; torture

1656303

1656306

; terror

1660940

1661155

58

1661163

BLOG06-20051207-056-0015738134 | 20050208T072956

; terrorism, academic freedom

1647047

1648116

1649108

BLOG06-20051207-060-0014521124 | 20050215T205800

;# politics | terrorism | 9-11

1646478

1646846

1646997

BLOG06-20051207-056-0016624520 | 20050203T090913

BLOG06-20051207-043-0002928883 | 20050210T195900

;# tsunami - UN

1647460

1650434

1652579

BLOG06-20051207-041-0014632987 | 20050205T160909

; v-day 17

# {5:4,8:7,10:3}

1649546

1649585

1649599

1649602

1649960

BLOG06-20051207-095-0004262868 | 20050207T122639

BLOG06-20051207-027-0011471287 | 20050208T225724

BLOG06-20051207-091-0028667535 | 20050211T155337

BLOG06-20051207-095-0011162242 | 20050213T024426

BLOG06-20051207-061-0009882858 | 20050215T000100

BLOG06-20051207-061-0009860129 | 20050215T023000

BLOG06-20051207-100-0003458648 | 20050215T034800

BLOG06-20051207-099-0032891589 | 20050215T062844

BLOG06-20051207-013-0002722984 | 20050215T155600

; work - check what topic gets assigned

1655377

1655432

59


1661228

BLOG06-20051207-041-0014516901 | 20050208T145144

BLOG06-20051207-102-0013289154 | 20050206T173727

; military 8

# {5:3,8:3,10:3}

1647074

1647107

1647374

1647889

1649226

1650339

1650826

BLOG06-20051207-056-0016578672 | 20050203T112616

BLOG06-20051207-000-0018200248 | 20050203T131249

BLOG06-20051207-056-0015771538 | 20050207T223644

BLOG06-20051207-060-0014589107 | 20050210T174114

BLOG06-20051207-060-0014575760 | 20050211T164100

BLOG06-20051206-022-0018529402 | 20050211T222400

BLOG06-20051207-000-0018123655 | 20050212T113100

BLOG06-20051207-033-0012313484 | 20050214T000913

BLOG06-20051206-022-0018492433 | 20050307T152200

BLOG06-20051206-056-0021994743 | 20050316T055441

;# military | iraq war

# {5:3,8:3,10:9}

1658201

1658202

BLOG06-20051207-043-0003351020 | 20050202T082933

BLOG06-20051207-056-0016843191 | 20050202T100403

BLOG06-20051207-056-0016812432 | 20050202T101502

BLOG06-20051207-056-0016460322 | 20050203T222103

BLOG06-20051207-043-0003098865 | 20050207T161258

BLOG06-20051207-056-0015716097 | 20050208T085557

BLOG06-20051207-041-0014433731 | 20050209T134551

BLOG06-20051207-043-0002971527 | 20050209T150007

BLOG06-20051207-043-0002844624 | 20050211T102500

BLOG06-20051207-060-0014549178 | 20050213T093700

;# military torture

1648322

1649394

60

1650883

1651927

1652830

1658077

BLOG06-20051207-023-0011378054 | 20050213T223800

BLOG06-20051207-056-0020623436 | 20050215T211300

;------------------

; politics

# {5:3,8:3,10:2}

;# pollitics | budget |

1653919

1658088

BLOG06-20051207-023-0011215456 | 20050228T195000

BLOG06-20051207-023-0010994236 | 20050319T120400

BLOG06-20051207-023-0010973646 | 20050319T120600

;# bush, state of the union

1647017

1647318

;--------------

1647471

1647558

1653861

1656274

1656282

1658176

1661152

1661174

1661182

1661222

1661239

BLOG06-20051206-008-0019893981 | 20050201T165600

BLOG06-20051206-008-0019872653 | 20050201T171200

BLOG06-20051207-036-0000970906 | 20050201T172234

BLOG06-20051207-049-0037430772 | 20050201T212306

BLOG06-20051206-008-0019829436 | 20050202T155900

BLOG06-20051207-056-0016673321 | 20050202T222544

BLOG06-20051206-008-0019765331 | 20050203T153000

BLOG06-20051207-056-0016548794 | 20050203T153628

61


BLOG06-20051207-056-0016337498 | 20050204T072922

BLOG06-20051207-000-0018175453 | 20050204T134200

BLOG06-20051207-036-0000871271 | 20050204T142814

BLOG06-20051207-056-0016258997 | 20050204T154017

BLOG06-20051206-005-0002273537 | 20050204T170331

BLOG06-20051207-043-0003267613 | 20050204T202015

BLOG06-20051206-008-0019721798 | 20050205T121900

BLOG06-20051207-043-0003182847 | 20050207T110933

BLOG06-20051207-056-0016059286 | 20050207T112650

BLOG06-20051207-056-0016030245 | 20050207T113652

BLOG06-20051207-056-0015800767 | 20050207T203945

BLOG06-20051207-060-0014642721 | 20050208T161421

BLOG06-20051207-060-0014629128 | 20050208T163118

BLOG06-20051207-060-0014615876 | 20050208T165852

BLOG06-20051207-006-0019619373 | 20050208T222338

BLOG06-20051207-056-0010698413 | 20050209T115825

BLOG06-20051207-041-0014454358 | 20050209T133514

BLOG06-20051207-041-0014413462 | 20050209T135114

BLOG06-20051206-008-0019592002 | 20050210T112300

BLOG06-20051207-037-0017857431 | 20050211T083600

BLOG06-20051206-008-0019548498 | 20050211T134600

BLOG06-20051207-037-0017787605 | 20050212T073800

BLOG06-20051207-060-0014562172 | 20050212T124900

BLOG06-20051206-008-0019526540 | 20050212T130200

BLOG06-20051206-003-0021156589 | 20050330T020800

BLOG06-20051207-023-0011399192 | 20050213T223100

;# pollitics | scandals

BLOG06-20051207-030-0004949018 | 20050215T193400

;# politics | scandals | gomery

BLOG06-20051206-008-0019634846 | 20050203T150200

BLOG06-20051206-008-0019700507 | 20050206T141100

BLOG06-20051206-008-0019808106 | 20050209T111200

;# pollitics | Bush

;# {5:3,8:2,10:2}

1647434

1647456

1647458

1647476

1647478

1647484

1647500

62

1647506

1647507

1647511

1647542

1647556

1653834

1653862

1653871

1656244

1656258

1656259

1656277

1658066

1658114

1658119

1658122

1658140

1658155

1661153

1661176

1661177

1661210

1661211

1661262

BLOG06-20051207-056-0016788974 | 20050202T112449

BLOG06-20051207-043-0003140787 | 20050207T134831

BLOG06-20051207-041-0014538284 | 20050207T151421

BLOG06-20051207-056-0015865902 | 20050207T190733

BLOG06-20051207-043-0003014765 | 20050209T140651

BLOG06-20051207-043-0002886450 | 20050210T213000

BLOG06-20051207-043-0002802327 | 20050211T104800

BLOG06-20051207-041-0014280422 | 20050215T082038

BLOG06-20051207-023-0011236141 | 20050228T194700

BLOG06-20051207-010-0002082101 | 20050225T154400

BLOG06-20051207-023-0010911305 | 20050328T200000

;# politics | europe | uk | Condolesa Rice

; # {5:3,8:2,10:2}

1647434

1658130

;# politics | europe

1658076

63

D. TOPIC EVOLUTION DURING TRAINING

BLOG06-20051207-033-0011010927 | 20050329T101500

BLOG06-20051207-033-0011859456 | 20050310T092200

BLOG06-20051207-033-0012150165 | 20050221T080600

BLOG06-20051207-033-0011916972 | 20050309T125700

;# politics | world | europe | royals

BLOG06-20051206-008-0019613577 | 20050210T111700

BLOG06-20051206-008-0019570424 | 20050211T110000

;# politics | world | europe | russia

1647413

1653831

1661232

;# middle east

# {5:3,8:2,10:9}

1647458

BLOG06-20051207-023-0011035213 | 20050316T200100

BLOG06-20051207-056-0020548713 | 20050220T212800

BLOG06-20051207-055-0002997060 | 20050303T161616

;# middle east | Iraq

# {5:3,8:2,10:9}

1658201

1658202

BLOG06-20051207-043-0003351020 | 20050202T082933

BLOG06-20051207-056-0016843191 | 20050202T100403

BLOG06-20051207-056-0016812432 | 20050202T101502

BLOG06-20051207-056-0016460322 | 20050203T222103

BLOG06-20051207-043-0003098865 | 20050207T161258

BLOG06-20051207-056-0015716097 | 20050208T085557

BLOG06-20051207-041-0014433731 | 20050209T134551

BLOG06-20051207-043-0002971527 | 20050209T150007

BLOG06-20051207-043-0002844624 | 20050211T102500

BLOG06-20051207-060-0014549178 | 20050213T093700

D Topic Evolution During Training

64

T1 T2 T3 T4 T5

0.004 time 0.004 time 0.004 time 0.004 time 0.005 time

0.004 day 0.004 street 0.004 music 0.004 street 0.004 street

0.004 music 0.004 music 0.004 street 0.004 music 0.004 music

0.003 street 0.003 play 0.004 play 0.004 play 0.004 art

0.003 play 0.003 day 0.004 art 0.004 art 0.004 play

0.003 people 0.003 art 0.003 day 0.003 day 0.003 film

0.003 art 0.003 people 0.003 people 0.003 film 0.003 day

0.003 york 0.003 york 0.003 film 0.003 people 0.003 people

0.003 book 0.003 film 0.003 york 0.003 theater 0.003 theater

0.003 theater 0.003 theater 0.003 life 0.003 life 0.003 life

T6 T7 T8 T9 T10


0.004 street 0.004 street 0.004 street 0.004 art 0.004 art

0.004 music 0.004 art 0.004 art 0.004 street 0.004 street

0.004 art 0.004 music 0.004 music 0.004 music 0.004 music

0.004 play 0.004 play 0.004 play 0.004 play 0.004 play

0.003 film 0.003 film 0.004 film 0.004 film 0.004 film

0.003 day 0.003 theater 0.003 theater 0.003 theater 0.003 theater

0.003 theater 0.003 day 0.003 life 0.003 life 0.003 life

0.003 life 0.003 life 0.003 day 0.003 day 0.003 book

0.003 people 0.003 people 0.003 people 0.003 people 0.003 people

T11 T12 T13 T14 T15


0.004 art 0.005 art 0.004 art 0.005 art 0.005 art

0.004 music 0.004 music 0.004 music 0.005 music 0.004 music

0.004 play 0.004 street 0.004 play 0.004 film 0.004 film

0.004 street 0.004 play 0.004 film 0.004 play 0.004 play

0.004 film 0.004 film 0.004 street 0.004 street 0.004 street

0.003 theater 0.003 theater 0.003 theater 0.003 theater 0.003 theater

0.003 life 0.003 life 0.003 life 0.003 life 0.003 life

0.003 book 0.003 book 0.003 book 0.003 book 0.003 book

0.003 people 0.003 people 0.003 people 0.003 people 0.003 include

Table 1: Topic evolution through training iterations

65

Documents

Textual Alignment of News and Blogs · 2016. 9. 7. · This kind of textual analysis has interested a group of researchers at ThomsonReuters. They would like to gain additional insight