Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Textual Alignment of News and Blogs
Author:
Carlos A. Morales Garduno
Supervisor:
Dr. Mark R. Stevenson
A report submitted in fulfilment of the requirements
for the degree of MSc in Advanced Computer Science
in the
Department of Computer Science
September 7, 2016
Declaration
All sentences or passages quoted in this report from other people’s work have
been specifically acknowledged by clear cross-referencing to author, work
and page. Any illustrations which are not the work of the author of this
report have been used with the explicit permission of the originator and
are specifically acknowledged by clear cross-referencing to author, work
and page.
I understand that failure to do this amounts to plagiarism and will be con-
sidered grounds for failure in this project and the degree examination as
a whole.
Name: Carlos Alberto Morales Garduno
Date: September 7, 2016
Signature:
II
Abstract
The number of blog posts published each day is estimated to be above 3 million.
As more people is using the blogosphere as its source of information. This change has
represented an issue for traditional media which needs to leverage the new opportunities
offered by the medium. A key step for news researchers, the automated identification
of links between news and blogs, would enable further analytical processing. Tracking
public response, identifying new leads, following the stories after print, new forms of
public engagement, are among the motivations for this. Experiments were conducted
using a subset of the New York Times Annotated Corpus 05 and the TREC BLOSG06
collection.
This project explores different techniques to identify the links. A baseline was defined
using cosine distance. Then, a procedure was developed to extrapolate document-to-
document comparisons for the exploration of topical relations. Later, an LDA imple-
mentation was explored to improve the baseline. A further improvement is introduced
by performing a second comparison with Kullback-Leibler divergence. To compare the
performance Precision and Recall were used. Additionally, a gold standard was created
which would enable further advancing the research in this area.
III
Acknowledgements
I would like to thank my supervisor, Dr Mark R. Stevenson, for taking me under his
guidance to do this Master’s Dissertation. Without his consistent support and shared
knowledge, it would not have been possible to complete this project.
Also, I am thankful with my personal tutor, Dr Robert Gaizauskas, for his feedback
and advise provided during this project and through the academic year.
I would also like to thank my wife Imelda Gonzalez for her continuous support and
companion on this project, especially during some critical moments.
This project was developed with the support of the Mexican National Council for
Science and Technology’s grant number 411174.
IV
Contents
1 Introduction 1
1.1 Project Aims and Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Project Aims . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Overview of the Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Literature Survey 5
2.1 Topic Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Latent Dirichlet Allocation . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.3 Gibbs Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Variational Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.5 Topic Detection and Tracking . . . . . . . . . . . . . . . . . . . . . 8
2.1.6 Tree-Based Topic Models . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.7 Symbolic Aggregate ApproXimation SAX . . . . . . . . . . . . . . 9
2.1.8 Hierarchical Relational Models for Topic Networks . . . . . . . . . 9
2.2 Document Similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Shingling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Rabin Fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Jaccard Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 jusText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 The GoldMiner algorthm . . . . . . . . . . . . . . . . . . . . . . . 13
V
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Methodology 16
3.1 Project Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 The Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 The New York Times Annotated Corpus . . . . . . . . . . . . . . . 19
3.2.2 TREC Blogs 06 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Creating the Gold Standard . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.4 Text Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.5 Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.1 Baseline - Cosine Distance . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.2 Word overlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.3 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Experiments and Results 29
4.1 Experimental Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.1 Metrics for this evaluation. . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.1 Removing boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Cosine distance and word overlap . . . . . . . . . . . . . . . . . . . 33
4.3.2 LDA model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.3 Observations on the Gold Standard . . . . . . . . . . . . . . . . . . 34
4.3.4 Topic Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.5 Additional observations . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5 Conclusion and Future Work 40
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Appendices 46
A Blogs 06 full XML sample . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
B NYT 06 Annotated Corpus full XML sample . . . . . . . . . . . . . . . . 52
C Gold Standard file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
D Topic Evolution During Training . . . . . . . . . . . . . . . . . . . . . . . 64
VI
List of Figures
2.1 Sample webpage showing the relevant content highlighted in gren and boil-
erplate in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Sample XML content from the NYT Annotated Corpus dataset. . . . . . 19
3.2 Sample XML content from the TREC Blogs 06 dataset. . . . . . . . . . . 20
3.3 Screenshot showing the original data split of the blogs corpus. . . . . . . . 21
4.1 Perplexity during training . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Visual analysis of the infered topics . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Visualization of duplicated topics. . . . . . . . . . . . . . . . . . . . . . . 38
4.4 The topics being duplicated seemed to increase as we targeted a higher
number of them during training. . . . . . . . . . . . . . . . . . . . . . . . 39
VII
List of Tables
1.1 Linking Example between Blog: BLOG06-20051207-043-0003056710, and
News Article: 1649918.xml . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1 Precision and Recall pseudo-contingency table. . . . . . . . . . . . . . . . 17
3.2 Example of the topic linking per document, as specified in our gold standard 23
4.1 Number of blog posts to process, with and without filtering boilerplate . . 30
4.2 Number of tokens to process in two documents, with and without filtering
boilerplate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Baseline and first features compared. . . . . . . . . . . . . . . . . . . . . . 33
4.4 Top 10 frequent words in the first 15 topics of the 25 topics LDA model . 35
4.5 Performance for the LDA model trained on 5 topics . . . . . . . . . . . . 36
4.6 Performance for the LDA model trained on 8 topics . . . . . . . . . . . . 36
4.7 Performance for the LDA model trained on 10 topics . . . . . . . . . . . . 36
4.8 General precission and recall achieved by each LDA model. . . . . . . . . 37
1 Topic evolution through training iterations . . . . . . . . . . . . . . . . . 65
VIII
1Introduction
Today, Web 2.0 applications are a common tool for sharing content online. Every
second, individuals and organised institutions alike use them to contribute to a broad
range of topics. Obeying the guidelines governing each platform, their formats range
from only short sentences or multimedia, as is the case of micro-blogs, to longer texts
interleaved with miscellaneous types of multimedia as the most permissive format of blogs
allows.
Since they first appeared back in 1999, blogs have shown a steadily growing tendency
on content and user base. The exact size of the blogosphere could be difficult to assess;
according to the statistics at wordpress.com more that 60 million blogs are published
each month; a similar estimate by technorati.com proposes that around 4 million are
created each day.
With topics ranging from the purely personal to covering world breaking news, blogs are
attractive from a data mining perspective. To properly exploit this potential traditional
NLP tools need to be adapted in order to leverage the diversity of writing styles common
in blogs. Otherwise, it would not be feasible to make sense of all the information available.
This kind of textual analysis has interested a group of researchers at ThomsonReuters.
They would like to gain additional insight on how their traditionally generated content
relates to the blogosphere. As Newman suggests, the way traditional and social media
interact has changed. By linking the blog posts related to news articles, not only can
they assess public reaction, additionally, the journalists can expand their stories or look to
obtain new leads. [Newman and of Oxford. Reuters Institute for the Study of Journalism,
2009]
1
1.1. PROJECT AIMS AND GOALS
To perform this initial task, a custom framework is required which would realise the
comparisons needed and generate a user-interpretable output with the result.
After the described initial approach is completed, the gained insight would enable an
informed decision to be made in order to proceed to perform deeper analysis as required.
To ground this project our implementation will focus on two datasets. Firstly, from
the New York Times Annotated Corpus [Sandhaus, 2008] a subset corresponding to the
year 2005 will be used to train our model. Then aiming to identify what blog posts link
with those news articles, the TREC’s Blogs 06 [Macdonald and Ounis, 2006] corpus will
be processed.
Key factors when selecting these datasets were the timeframe coverage, textual di-
versity, availability, and ease of access. We were aiming for the highest number of news
articles and blogs posts published during a year. From the few other blogs datasets fulfill-
ing the requirements. The TREC Blogs 08 and The ICWSM-2009 Spinner dataset, both
had a premium price tag. Another one made available by the Common Crawl1 organisa-
tion required extra analysis and processing to isolate the blogs they collect. With respect
to the news dataset, the Reuters Corpora2 seemed promising but was not available at
the time.
1.1 Project Aims and Goals
1.1.1 Project Aims
The core aim of the project is to generate a collection of links between news stories
published online and blog posts covering the same topic. This collection would then
allow subsequent analytical processing. Topic coverage and trend changes are among the
metrics to explore.
1.1.2 Goals
After reviewing literature related to our work, we envisaged the following approach to
realise the proposed goals.
1. Gold Standard: Currently, no dataset of this kind is available. In order to perform
the comparisons we propose, one must be manually crafted. Given the datasets
1http://commoncrawl.org/2http://trec.nist.gov/data/reuters/reuters.html
2
CHAPTER 1. INTRODUCTION
size, their mutual differences, and the effort required to generate a quality product,
this creation is considered to be not only a fundamental step for this project, but
a major contribution. Such a task would enable future comparisons of competing
models without the burden of creating a similar dataset.
2. Topic Modelling: There is the need to consider that one or more news articles
can be covering the same topic and a single article can expand across two or more
topics. Also, the timeframe for event coverage can span from one day to a couple
of weeks. Relevant articles for a topic can be identified and linked to blog posts
covering the same topic.
3. Topic Linking: Lastly, of special interest for this work is the exploration of means
leading to the successful detection of links between news and blogs. Based on the
inherent differences among our data sets, this suggests being a challenging task.
To motivate this point, Table 1.1.2 shows a blog post and a news article covering
the same topic. The reader should at least notice the syntactical and grammatical
differences present.
1.2 Overview of the Report
In order to visualise the structure of this report we now provide a high-level description
of each chapter.
Chapter 2 presents a review of previous works and an introduction to those techniques
used in the project.
Chapter 3 allows us to discuss the methodology followed, it also introduces our technical
and risk analysis.
Chapter 4 contains our experiments and observed results.
Chapter 5 gives way to our conclusion and future work.
3
1.2. OVERVIEW OF THE REPORT
Blog Post News Article
CNN Clarifies Remarks by News Exec.
Eason Jordan
Two weeks ago at the World Economic
Forum in Switzerland, CNN’s chief
news executive, Eason Jordan, raised
eyebrows when he suggested that some
of the 63 journalists who have been
killed in Iraq had been targeted by US
troops. Although Jordan quickly
tempered the remarks, a controversy
has been building over them on the web.
CNN has responded, issuing a statement
clarifying Jordan’s comments. It reads as
follows: ”While the majority of the 63
journalists killed in Iraq have been
killed by insurgents, the Pentagon has
acknowledged that the US military on
occasion has killed people who turned
out to be journalists,” the CNN
statement said. ”Mr. Jordan emphatically
does not believe that the US military
intended to kill journalists and now
believes that the US military mistakenly
identified the journalists as Iraqi civilians.”
# posted by **** **** @ 2:23:48 PM
CNN Executive Resigns Post
Over Remarks
Eason Jordan, a senior executive
at CNN who was responsible for
coordinating the cable network’s
Iraq coverage, resigned abruptly
last night, citing a journalistic
tempest he touched off during a
panel discussion at the World
Economic Forum in Davos,
Switzerland, late last month in
which he appeared to suggest that
United States troops had
deliberately aimed at journalists,
killing some.,Though no transcript
of Mr. Jordan’s remarks at Davos
on Jan. 27 has been released, the
panel’s moderator, David Gergen,
editor at large of U.S. News &
World Report, said in an interview
last night that Mr. Jordan had
initially spoken of soldiers, ”on
both sides,” who he believed had
been ”targeting” some of the more
than four dozen journalists killed
in Iraqmore than four dozen
journalists killed in Iraq
Table 1.1: Linking Example between Blog: BLOG06-20051207-043-0003056710, and
News Article: 1649918.xml
4
2Literature Survey
We present an introduction to core topics for the project with the intention to famil-
iarise the readers with the field and ease their way through this document.
2.1 Topic Modelling
As Blei explains in his probabilistic topic models review, topic modelling algorithms are
statistical methods capable of analysing a text corpus to identify what themes go through
it, how they connect and change over time [Blei, 2012]. It is important to mention that,
as unsupervised models, they infer those hidden parameters on their own with no need
for pre-classified or pre-labeled texts.
The simplest of these models, Latent Dirichlet Allocation, was introduced in 2010 and
is the focus of our work.
2.1.1 Latent Dirichlet Allocation
Classified as a generative probabilistic topic model, it assumes that data is generated
from a joint probability distribution of observed and hidden random variables [Blei, 2012].
Considering the topics and their distributions as the hidden variables, and the words of
the documents as the observed variables.
To formally describe this model, we present its fundamental formulae using equation
2.1 as shown on Blei’s review, where the topics are represented by β1:K . The topic
proportions for the dth document are θd. The topic assignments for the dth document
5
2.1. TOPIC MODELLING
are zd. Finally, the observed words for document d are wd.
p(β1:K , θ1:D, z1:D, w1:D) =K∏i=1
p(βi)D∏
d=1
p(θd)
(N∏
n=1
p(zd,n|θd)p(wd,n|β1:k, zd,n)
). (2.1)
As explained by Blei, this model has a number of dependencies. Namely, the topic
assignments zd,n depend on the per-document topic proportions θd. The observed words
wd,n depend both on the topic assignment zd,n and on the topics available β1:K [Blei,
2012].
An intuition of the concept underpinning this model lays on the assumption that each
topic has an inherently strong bond with a previously defined collection of words. These
words are then sampled to generate the documents. On this basis, statistical analysis of
the textual content from the documents in a corpus allows inferring its topics.
Two common approximations to said inference are categorszed as sampling approaches
or optimisation approaches. Both are described later in this document.
Additional to the number of implementations and extensions to LDA covered by Blei’s
review [Blei, 2012], we briefly introduce five recent publications involving this model.
Firstly, [Han et al., 2013] present an extension of LDA used to rank web forum posts.
Secondly, [Qi et al., 2015] describe an LDA version able to handle briefly appearing
topics. Thirdly, [Vorontsov and Potapenko, 2014] added linguistic assumptions to the
model. Fourthly, [Madani et al., 2015] did an implementation to detect trending topics in
real-time on Twitter. Fifthly, [Tristan et al., 2015] propose an alternative training method
using Graphical Processing Units . Finally, [Hu et al., 2013] propose a method to allow
non-experts in machine learning to provide their feedback to the models generated by
LDA .
Apart from them being covered as part of the initial exercise of familiarising with topic
models. These papers are mentioned here to attest the prevalent popularity of this model
within the research community .
A highly relevant issue for our project and a key principle of LDA, The previously
mentioned a-priori definition of topic distributions, can turn into a difficulty when trying
to identify these distributions with no prior knowledge of the topics. An active area of
research has been addressing this problem.
6
CHAPTER 2. LITERATURE SURVEY
As explained by Battisti et al., defining the correct number of topics is not a trivial
task. There is a tradeoff between setting a number large enough to cover all the topics
present in the corpus, while keeping it as low as possible to match the number of topics
identified by domain experts [Battisti et al., 2015]. To aid this task, a measure has been
introduced under the name of perplexity. This process is then approached by trying to
approximate the adequate number via a variant of Expectation Maximisation.
2.1.2 Perplexity
Defined as the reciprocal geometric mean of the likelihood of a test corpus given the
model, it measures the ability of a model to generalise on unseen data [Madani et al.,
2015]. In other words, it gives a measure of how well the model describes or represents
data it has not seen before. The equation 2.2 formalises this concept, here ...,
PP (W ) = exp
{− log(p(w))∑D
d=1
∑Vj=1 n
(jd)
}(2.2)
When evaluating LDA models, lower perplexity scores mean the model is best repre-
senting the words in a test document given the inferred topics.
With respect to the inference task, as previously stated, common approaches in this
area are categorised as sampling or optimisation. Based on their relevancy for our project,
we describe two instances of them, one from each of the classes.
2.1.3 Gibbs Sampling
A disseminated sampling technique for LDA to discover what topics go through the
dataset, the topic distributions, and how each word corresponds to a given topic [Battisti
et al., 2015; ?].
It is a common practice to perform Gibbs sampling by integrating together the per-
document distribution over topics and the topic distributions over words, and sampling
only for the per-word topic assignment [Battisti et al., 2015; Hu et al., 2013; Vorontsov
and Potapenko, 2014; Yan et al., 2013]. This technique is formalised by the following
formulae,
p(zd,n = k|Z , wd,n) ∝ (αk + nk|d)β + nwd,n|k
βV + n.|k, (2.3)
7
2.1. TOPIC MODELLING
Where d is the document index, and n is the token index in that document; nk|d is
topic k’s count in document d; αk is topic k’s prior; Z are the topic assignments excluding
the current token wd,n; a is the prior for word wd,n; nwd,n|k is the count of tokens with
word wd,n assigned to topic k; V is the vocabulary size, and n.|k is the count of all tokens
assigned to topic k.
Given that optimisation methods have generally been found to be faster, and as accu-
rate as Gibbs sampling at inference when dealing with large datasets, they are favoured
over the sampling methods.
2.1.4 Variational Bayes
Methods in this class aim at optimising a simplified parametric distribution in order
to make it closer to the posterior, divergence-wise. An online stochastic version of Varia-
tional Bayes is used in this project. Although a core strength of this method is its ability
to handle large volumes of data with no need for local storage. We leverage only its
convergence speed [Hoffman et al., 2010]
2.1.5 Topic Detection and Tracking
It is a programme created in 1999 by the National Institute of Standards and Tech-
nology (NIST) with the aim of developing technologies to search, organise and structure
multilingual news materials. Its introduction defined two fundamental ideas in the area,
the continuous processing of data as it is being collected, and the concept of a topic. In
its first versions, it defined five research tasks: Topic Tracking, Link Detection, Topic
Detection, First Story Detection and Story Segmentation [Fiscus and Doddington, 2002].
It is worth mentioning, several of the NLP technologies nowadays available were mo-
tivated by this programme.
2.1.6 Tree-Based Topic Models
In his work, Hu et al. mention a weakness of LDA, using a symmetric prior for all
words, ignoring relations between them. To overcome this limitation, he proposes tree-
based topic models, arguing trees are a natural formalism for encoding lexical information
[Hu et al., 2013].
8
CHAPTER 2. LITERATURE SURVEY
He replaces multinomial distributions with their tree-structured equivalences for each
topic, associating each leaf node with a word. To encode correlations in a tree, the prior
for each word needs to be known. Then starting from the root node he links with it every
correlated word having the same prior. Later, for each of the linked words if they are
correlated, he removes their previous link and connects them together with a new parent
node. Linking this new joint node with the root, thus merging links for correlated words.
Later, to generate each topic distribution, one needs to traverse the tree. Starting from
the root node, take a child node. If is not a leaf node, take a child again and repeat until
reaching a leaf node. The described process would be equivalent to sampling once from
a multinomial distribution.
2.1.7 Symbolic Aggregate ApproXimation SAX
An algorithm for event discovery based on transforming word temporal series into
a string of symbols [Stilo and Velardi, 2016]. It works in conjunction with Regular
Expressions and clustering algorithm to detect events in social media streams.
Although this work does not deal with topic identification, the author justifies the
comparison of his model against a variation of LDA, because the last one is foundational
for competing models in topic identification.
2.1.8 Hierarchical Relational Models for Topic Networks
This work proposes to use techniques from Network Analysis to consider each document
as a node on the network and to identify relations or links between documents based only
on each document attributes’ [Wang et al., 2015]. Said attributes could be the document’s
author, its collaborators, and so on. If a trained model is presented with a node and its
links, it would predict a distribution of node attributes.
An improvement over competing network based models, this handles both the structure
of the network and the attributes of its nodes at the same time.
As in traditional LDA, this model considers documents as being generated from topic
distributions while the links between documents are represented as binary variables, and
are treated as dependent on the topics that generated each document.
9
2.2. DOCUMENT SIMILARITY
2.2 Document Similarity
Nowadays it is estimated that as much as 40% of the internet web pages are duplicated
[Manning et al., 2008]. There is a variety of reasons for that, from simple redundancy
policies to copycat websites trying to trick users. This causes an increased need for re-
sources like storage, bandwidth, hardware maintenance and so on. For that reason, it has
become a keen topic for the search and storage research community. Many methods have
emerged trying to address the different issues involved still, as mentioned by Manning
[2008], this phenomenon represents an open problem which requires addressing.
It is appropiate now to present the Vector Space Model, a key concept when dealing
with document representations and similarities.
Vector Space Model Each document is represented as a vector in an n-dimensional
space. To do so, a technique known as Bag-of-Word is used to contain the term frequency
count of each document elements. To perform a query against this space, the query is
converted to its vector space model representation. This conversion is equivalent to
placing the query in its corresponding location of the vector space. Then, the vectors
located closest to the query correspond to the most similar documents.[Salton et al.,
1975]
There are several options to measure this similarity. The inner product of the vectors
being one of them or a measure of the angle between them can be computed. We opted
for the cosine distance.
Cosine distance, A measure of similarity between two vectors, it does so by measuring
the cosine of the angle between them. It is bounded to 0 and 1 and is defined by the
equation 2.2. [Singhal, 2001],
cos(θ) =
∑ni=1AiBi√∑n
i=1A2i
√∑ni=1B
2i
When applied to document similarity it yields a value of 1 for documents with exactly
the same content. A value of 0 for documents not sharing any content. And a number
between 0 and 1 for other cases.
From the methods to detect duplicated websites, a naive one is to process the page’s
characters in order to generate a brief representation of it, its fingerprint. This shorter
10
CHAPTER 2. LITERATURE SURVEY
version of each document is used for comparison purposes. When another website’s
fingerprint matches, their whole content is compared at the character level. If there is
an exact match they are flagged as duplicate and only one of them is preserved. Simple
as this method might seem, it has been well established and is core to several others.
Another approach to detecting duplicates, as expressed by Zhang, generates an abstract
for each document or website by using two strings plus a md5 hash for comparison
[Zhang and Zhang, 2012]. To form the abstract it uses the syntactic-free word-count
representation of each document, known as bag-of-words. The first string is generated
by concatenating the top representative words by their counts. Then, the second string
is generated by concatenating the top representative words again although sorted in
alphabetical order. These two strings are then paired with the md5 hash of the document.
A weakness of these and similar approaches comes when dealing not with exact dupli-
cates but with highly similar documents, presenting just slight variations in their content.
To exemplify this, it would be adequate to consider the case of two instances from the
same document which just differ on the date or time of publication. Both instances would
produce a distinct md5 hash or fingerprint while in fact their meaningful content must
be considered duplicated.
2.2.1 Shingling
An alternative solution to deal with this type of close similarity was published by
Broder as a technique known as shingling [Broder, 2000]. He proposes for each document
to be defined by k-shingles, or in other words, by the set of consecutive sequences of k
terms taken from the document. To help the reader get acquaintance with this technique,
an example taken from his work is reproduced here.
For this example we consider a very succinct document containing exactly the words
(a,rose,is,a,rose,is,a,rose). Its corresponding 4-shingling would be {(a,rose,is,a), (rose,is,a,rose),
(is,a,rose,is)}.
Once the shingles are computed the documents can be compared by computing their
Jaccard coefficient, defined by equation 2.4, if their score goes above an established
threshold, they can be flagged as duplicates.
j(A,B) =|SA ∩ SB||SA ∪ SB|
. (2.4)
11
2.2. DOCUMENT SIMILARITY
In terms of nomenclature, when dealing with alphabet letters shingles are also known
as q-gram, while w-shingling is the accepted reference to associate a document with its
set of shingles of size w.
As Broder explains, this technique can also be used for categorization tasks. A lim-
itation of this approach is the requirement for each cluster to be clearly defined and
separable from the rest. Then a document would be categorised as belonging to the
cluster which it shares the most shingles with. For our approach, this seems to be a less
desired method given the writing style differences between news and blog posts, and the
intuition of non-separability for some topics.
2.2.2 Rabin Fingerprints
Initially introduced by Rabin in 1981 as an efficient fingerprinting technique [Rabin,
1981], this method provides core functionality for the shingling algorithm as proposed by
Broder [Broder, 2000].
Rabin proposed to use randomly chosen irreducible polynomials to generate file finger-
prints to identify unauthorised local file modifications. According to this method, when
a file is created, its fingerprint F gets computed and is safely stored by a guardian. An-
other fingerprint F1 will be computed whenever file integrity needs to be assessed. Then,
by comparing both values it would be simple to know if a file was altered without proper
authorization, there would be a mismatch on the fingerprints F 6= F1.
2.2.3 Jaccard Coefficient
Commonly used to measure the similarity between two datasets, this coefficient is
based on the Jaccard index, previously introduced. As shown in equation 2.4, it is
the result of dividing the number of features common to both datasets, by the number
of properties. Jaccard similarity can also be computed by the inverse of the Jaccard
Coefficient [Suphakit Niwattanakul and Wanapu, 2013].
jd(A,B) = 1− j(A,B) =|SA ∪ SB| − |SA ∩ SB|
|SA ∪ SB|. (2.5)
Alongside document similarity detection, another key concept when dealing with web
content is Boilerplate.
12
CHAPTER 2. LITERATURE SURVEY
2.3 Boilerplate
Common content found on websites and blogs is made up of diverse functional parts
accompanying the central content. Headers, footers, programmatic scripts, and nav-
igational aids, are commonly referred to as Boilerplate. They are charachterised by
seemingly uniform content across the website they belong to. Although useful to ease
human access, their prevalence makes them irrelevant for text processing tasks. Some ap-
proaches to deal with them rely on statistical measures to characterise them on a site by
site basis. Two models successful in this area are the GoldMiner algorithm and justText.
2.3.1 jusText
A freely available1 tool to remove bolierplate. It aims to preserv relevant textual
content by analysing the sentences length and the ratio of links and tags in a block of
content2. It is described as ”well suited for creating linguistic resources such as Web
corpora.”
By implementing justText in our project, we were able to avoid processing innecesary
content. Figure 2.1 shows a representation of boilerplet in a webage3. Section 4.2.1
contrasts the results of using this algorithm to remove the boilerplate from one of the
13227 individual blog sites in this project.
2.3.2 The GoldMiner algorthm
Presented as an improvement over the previously introduced jusText. Their achieve-
ment was due to the use of machine learning to better characterise boilerplate content
[Endredy and Novak, 2013].
Although its published performance seems promising, at the time of writing this dis-
sertation we could not find a publicly available Python implementation of Goldminer.
For this reaso,n we decided to use JusText.
1https://github.com/miso-belica/jusText2http://corpus.tools/wiki/Justext/Algorithm3Sample image taken from the project website at http://corpus.tools/attachment/wiki/Justext/nlp jusText fi.jpg
13
2.3. BOILERPLATE
Figure 2.1: Sample webpage showing the relevant content highlighted in gren and boil-
erplate in red.
14
CHAPTER 2. LITERATURE SURVEY
2.4 Summary
In this chapter, we presented core concepts for our project. Topic modelling, Latent
Dirichlet Allocation, Gibbs Sampling, Topic Detection Tracking, Document Similarity,
and Shingling are part of the techniques underpinning this project.
We now proceed to present the methodology to follow.
15
3Methodology
In this chapter we describe our approach, first reviewing the aim of the project, then
detailing each step involved.
3.1 Project Aim
As mentioned in Chapter 1, the core aim of the project is to generate a collection of
links between online published news stories and blog posts covering the same topic or
event.
This linking will enable further analytical processing; the basic metrics evaluated are
precision, recall, and topic coverage.
To this end, after studying the publications relevant for the literature survey, we en-
visaged the following approach for this project:
1. Gold Standard. Given that no gold standard dataset exists, creating one was
paramount. Due to its high relevance for the project, section 3.3 is dedicated to
detailing how this process was performed. The resulting gold standard is available
in Appendix C
2. Inferring latent topics. To work with topic modelling techniques, we relied on the
LDA implementation available in the python package gensim.[Rehurek and Sojka,
2010]
16
CHAPTER 3. METHODOLOGY
Relevant Not Relevant
Retrieved A ∩B A ∩B B
Not Retrieved A ∩B A ∩B B
A A
Table 3.1: Precision and Recall pseudo-contingency table.
Model Setup. As previously stated, this model was trained on the NYT Anno-
tated Corpus. For this, a subset of the dataset was used. Corresponding to the
months of February and March, it contained 14,748 news articles.
Number of Topics. Given that the actual number of topics is unknown to us,
we decided to explore varying the number of target topics for each run, from 5
to 200 topics. Although we are aware of the active research being conducted on
automated number of topics identification, their application was judged as out of
scope for this project.
Hyperparameters for LDA. Alpha and Eta, are related to the term sparsity
for documents per topic, and for topics per token distributions, respectively. Both
were initialized with an unbiased prior of one over the number of target topics1. The
inferring method was online Variational Bayes. Finally, the number of iterations
on all cases was set to ten.
3. Evaluation. To test our model’s performance, we defined a baseline using cosine
similarity, then experimented how the variations in different parameters affected
the results. The metrics used for our exploration are Precision and Recall [van
Rijsbergen, 1979]. To illustrate the concept, table 3 shows the pseudo-contingency
table presented by van Rijsbergen [1979].
Precision The proportion of retrieved documents that are actually relevant.
What percentage of the documents retrieved and linked to topic n coincide with
the corresponding list of the gold standard. As shown in equation 3.1 referencing
elements from table 3.
Precision =|A ∩B||B|
(3.1)
1More details for these parameters can be read at: https://radimrehurek.com/gensim/models/ldamodel.html
17
3.2. THE DATASETS
Recall The proportion of relevant documents that were retrieved. What percent-
age of the documents in topic n from the gold standard were correctly retrieved
and linked. As shown in equation 3.2 referencing elements from table 3.
Recall =|A ∩B||A|
(3.2)
4. De-duplication. Although we intended to apply the shingling algorithm with the
aim of filtering highly similar blog posts prior to their linking; and thus decreas-
ing the processing time required for the linking step. We gave preference to the
exploration of topic linking.
3.2 The Datasets
After familiarising ourselves with the characteristics of the original file formats of our
datasets, we decided to take advantage of their common XML container format. To
simplify our task, we implemented custom readers to extract the information we needed.
For this, we leveraged the functionality of BeautifulSoup42 to process XML and HTML
content.
Then, we experimented with two approaches. The first one, based on generating
individual feature files for each news article and blog post. This was a convenient format
to allow direct access to each file. But with each additional feature adding about 4,000
files to the blog posts and 14,000 files to the news articles. This number of small-
sized files introduced technical issues when trying to use the Iceberg HPC cluster, or
other computersfor processing. Network errors and I/O overhead were among the most
common issues which throttled our process.
For the second approach, we opted for a file-based self-contained database, SQLite3.
This allowed us to keep just one large file per dataset with one record per document, and
a column per feature. Now, to add new features we just needed to create a new column
to contain it. This was a key move that enabled us to test additional features.
We now present the specific details about each dataset and how they were processed.
2https://www.crummy.com/software/BeautifulSoup/3SQLite homepage URL: https://sqlite.org/
18
CHAPTER 3. METHODOLOGY
3.2.1 The New York Times Annotated Corpus
The original corpus spans 20 years of the historical archive of the New York Times. It
includes almost every content they published from January 01, 1987 to June 19, 2007,
and is presented in cleanly formatted XML files enrichened with metadata following
version 3.3 of the News Industry Text Format (NITF) specification 4.[Sandhaus, 2008]
Our project focuses on the subset corresponding to their content generated in 2005.
Figure 3.1: Sample XML content from the NYT Annotated Corpus dataset.
For this specific corpus, a custom reader was developed to extract the relevant parts
for our goals, the news articles content. This was identified to be stored inside the tag
’<block class=”full text”>’.
Aside from the main article, other features were also extracted.
• The publication date, stored inside the tag <pubdata date.publication=”” />
• Its headlines, using the tag <hl1></hl1>
4Version 3.3 of the specification is available at: https://www.iptc.org/std/NITF/3.3/documentation/
19
3.2. THE DATASETS
• Taxonomic keywords, added by the NYT editors inside the tags <classifier class=””
type=”descriptor”>
Although these features were planned to be used for additional comparisons, time re-
strictions hindered this. We describe a plan to do so in section 5.2.
3.2.2 TREC Blogs 06
Originally created by The University of Glasgow as a Test Collection for a TREC task,
its aim was to be a realistic snapshot of the blogosphere long enough for events to be
recognisable. It covers topics such as news, sports, politics, and health. [Macdonald and
Ounis, 2006]
Figure 3.2: Sample XML content from the TREC Blogs 06 dataset.
20
CHAPTER 3. METHODOLOGY
Its data format differs from that of the NYT Annotated Corpus in three fundamental
aspects:
Figure 3.3: Screenshot showing the original data split of the blogs corpus.
1. It uses a 3-parts data split to represent each blog, as shown in figure 3.3. One
stores the blog’s static pages, the second one the dynamic content of the blog site,
and the third one a RSS5 feed for each site.
2. The content language is diverse and not identified.
3. Its content is in raw, non-normalized, format. By storing the full HTML files,
they preserved the originally crawled sites with complete embedded scripts, style
sheets, spam posts and comments. For our task this represents a copious amount
of boilerplate to remove from each blog post.
Additionally, we identified the following:
• Each feed, homepage and permalink were assigned a unique identifier which was
kept during the whole period.
• Permalinks are identified by a DOCNO of the form: BLOG06-20051206-000-0000026837.
• Their documentation also states each permalink was retrieved more than once. We
filtered the duplicates using the DOCNO identifier.
As a result, loading and extracting only the relevant information from each blog post,
its textual content, required a more sophisticated reader than the one used for the NYT
Annotated Corpus.
5RSS: Rich Site Summary
21
3.3. CREATING THE GOLD STANDARD
3.3 Creating the Gold Standard
Although, we anticipated this was a challenging task. The actual effort required to ac-
complish a representative dataset largely surpased our expectations. This section presents
two of the approaches we performed. Selected for being the most largely used versions
of the process, within our project. They enabled the collection of 181 News Articles and
161 Blog Posts.
After an initial exploration of the news articles, and looking to simplify this process.
We decided to focus our efforts on a list of broad topics: books, business, arts, educa-
tion, health, sexual preferences, social theories, human rights, journalism, civism, sports,
technology, terrorism, valentine’s day, military, politics, middle east.
Initial approach.
1. First, we decided on the timeframe to analyse, the months of February and March.
2. from the NYT corpus, we copied the documents corresponding to this time period
to a new location.
3. We randomly choose one of those news articles as our candidate for matching and
manually analysed its content. We tried to identify target topics covered and looked
for keywords on the text portions or on its headers. We also extracted its date of
publication.
4. Subtracted a subset of the blogs corpus covering one week prior to the date we just
defined.
5. Using POSIX terminal tools we searched for blogs containing the identified key-
words or topic-related words commonly associated with the observed topics.
6. The matching blog posts were opened in a web browser to observe their content.
This was necessary to dismiss spam posts or cases where the match was found only
in the comments.
7. Finally, each reference to a blog post or news article was saved in a list of documents
per topic, using a text file for this.
This procedure was highly time-consuming and the number of collected samples was
minimal. We found that most of the tentative links were in the form of spam comments
or boilerplate content. For that reason, a number of improvements were introduced until
we established a better performing process.
22
CHAPTER 3. METHODOLOGY
LDA 5 Topics LDA 8 Topics {5:3,8:5}1646589 Topic 3 Topic 5 1646589
1646894 Topic 3 Topic 5 1646894
1647470 Topic 1 Topic 5 {5:1,8:5}1654653 Topic 1 Topic 5 1647470
Table 3.2: Example of the topic linking per document, as specified in our gold standard
Improved approach
1. Using justText, we subtracted the non-boilerplate text from the blog posts gener-
ated on the previously selected time period.
2. Then, we generated a new feature. Using an implementation of the RAKE algo-
rithm we looked to identify the most relevant keywords for each blog post. [Rose
et al., 2010]
3. From the news articles produced in the months of February and March, we choose
one as a candidate. Its content was manually analysed in the previously described
way.
4. Using POSIX terminal tools again, we searched for matches on the recently gener-
ated keywords.
5. The matches were individually analysed using a text reader.
6. Finally, the reference to each blog post and news article was stored in the gold
standard index file.
By doing this, we streamlined our process by effectively reducing the number of non-
relevant tentative links that needed evaluation. After several iterations, the number of
documents in the gold standard increased considerably.
Once we finished this process we confronted another problem. How to conciliate our
identified topics with those inferred by LDA.
Given the unsupervised nature of LDA, we decided to settle for the simplest approach.
To observe the most relevant words in each of the LDA identified topics and align them
with the most resembling from the gold standard topics. By doing this we also needed
to generate custom gold standards to more closely match the number of target topics in
LDA. The reason for this is discussed in section 4.3.3
23
3.4. TEXT PREPROCESSING
The implemented solution was based on key-value pairs, where the keys are related to
the number of target topics for LDA, while the values specify the corresponding topic ID
for the documents. This allows for a single documents list to be used for the multiple
scenarios needed. Table 3.3 shows an example of the topic linking per documents and
its representation in the file format of our solution. Four documents are linked to three
different topics depending on the LDA model used. The flexibility gained with this
approach should be easily noticeable to the reader.
The completed gold standard is available in Appendix C.
Once we had a gold standard representative of the datasets, we were able to move
forward in the process. In order to get the data ready for its processing, it is fairly
common among NLP projects to introduce a preprocessing step.
3.4 Text Preprocessing
Our initial approach was to perform the following set of tasks on our data sources using
scikit-learn. Then, we experienced difficulties trying to generate features like n-grams
or k-skip-n-grams. For his reason, we recurred to Gensim. The rest of the project was
implemented based on this python package unless otherwise specified.
1. Source file reading. Each dataset file was traversed and processed according to
its peculiarities. After a series of steps involving data reading, target element
identification, and text extraction, we ended up with a one-line representation of
each document.
2. Tokenising. The obtained document’s content is separated into their minimal mean-
ingful representational elements, known as tokens. For this, we used NLTK to
identify the sentences and its token.
3. Stop-words. Terms which are commonly seen in the majority of the document from
a corpus lose relevancy compared to others less frequent. It is best to remove them
keeping only the most relevant. We used the NLTK’s English stop-word list for
this.
4. Lemmatizing. Removing inflectional word endings to obtain the proper form of a
word help to achieve a degree of text normalisation. For this, we use the WordNet
Lemmatizer.
24
CHAPTER 3. METHODOLOGY
5. Bag-of-Words. Each document’s tokens are converted to their bag-of-words repre-
sentation. This is the format required to proceed further processing.
Before continuing, an additional step was performed which should be noted. The
lemmatized tokens for each document were saved to our database in order to use these in
subsequent processing tasks. The same saving was done for the n-grams and skip-grams.
Now, an example of the transformation undergone by each document in our corpus is
shown here using one of the blogs posts. The original text:
[”This blog will chart the most popular British blogs, based on their stat
page figures. This means that those blogs who don’t have publicly accessible
stat pages don’t make the chart. Which means Samizdata, reputedly the
most popular British blog, won’t appear, and tons of others won’t either.
All I can say to them is, if you want to be in the Top Ten, put your stats
up, or make them publicly accessible. (Try Sitemeter, for example). If you
know of any British sites that have bigger figures than this, which I’ve missed,
let me know.Mail: britishblogs-at-yahoo.com The point of this site, I should
add, is not so much to create a fuss about viewing figures, because popularity
isn’t everything, but to provide a convenient and interesting place where new
UK blog readers might start from. Note: Although I couldn’t access The
Policeman’s Blog or Pootergeek’s Sitemeter figures, these figures were listed
at Truth Laid Bear. Speaking of Truth Laid Bear, why not use the ranking-
by-links method that Truth Laid Bear uses? Two reasons: (1) That system
seems to unfairly favour those who got into the blogosphere early. In fact,
many dead sites are still ranked highly on this method, because estalished
bloggers rarely update their blogrolls. (2) I can’t be bothered.”]
After applying all the preprocessing steps previously listed, the document becomes:
[blog, chart, popular, british, blog, base, stat, figure, blog, publicly, acces-
sible, stat, chart, samizdata, reputedly, popular, british, blog, win, appear,
ton, win, stats, publicly, accessible, sitemeter, example, british, site, bigger,
figure, mail, britishblogs, yahoo, site, add, create, fuss, view, figure, popu-
larity, provide, convenient, blog, reader, start, note, access, policeman, blog,
pootergeek, sitemeter, figure, figure, list, truth, laid, bear, speak, truth, laid,
bear, ranking, link, method, truth, laid, bear, reason, unfairly, favour, blogo-
25
3.5. PROCESSING
sphere, dead, site, rank, highly, method, estalished, blogger, rarely, update,
blogrolls, bother]
Finally, the bag-of-words representation gives the following sparse vector:
[(593, 1), (866, 1), (1083, 1), (1183, 1), (1533, 1), (2392, 1), (4024, 1),
(4284, 1), (5188, 1), (6209, 1), (6241, 1), (6887, 1), (7656, 1), (8526, 1),
(8637, 1), (8855, 1), (8905, 1), (9104, 1), (9199, 1), (10991, 3), (11524, 1),
(12183, 3), (12774, 6), (13211, 1), (14079, 3), (15356, 1), (16322, 3), (16434,
2), (16504, 1), (16726, 2), (16738, 2), (17137, 3), (17150, 1), (17201, 2),
(17348, 1), (17567, 1), (18157, 1), (18207, 1), (19032, 2), (19390, 1), (19497,
1), (19586, 5), (19707, 2), (19755, 1), (19939, 1), (20203, 1)]
Once our content is in this format we can continue with its processing.
3.5 Processing
Using the generated Bag-of-Words representations of our documents, we performed the
following evaluations.
3.5.1 Baseline - Cosine Distance
As previously stated the project baseline was created using the cosine distance between
documents. To do so, we set a threshold of 50% to identify a pair of documents as linked.
Then, if a blog post is linked to more than one news article, we sort the links based on
the computed score. That effectively creates a list of links between blogs post and news
article. Then, we save the results of these computations to a text file.
The process used to identify the topics using this linked pairs is detailed later in this
section. We were also interested in exploring how other features performed for the task.
3.5.2 Word overlap
With the intuition that documents covering the same topic have at least a slight ratio of
word overlap. We set a 10% threshold for the number of shared tokens between any pair
of news articles and blog post to be considered linked.
26
CHAPTER 3. METHODOLOGY
Following the same intuition but aiming to add lexical features to our comparison, we
explored k-skip-n-grams. These are token constructions similar to common bigram and
trigrams but they add an extra skip distance between their elements. We explored with
values of k=[0,3] and n=[2,3] setting a threshold of 5% for every variant6.
To estimate the overlap ratio we used equation 3.3:
Overlap =|commonfeatures| × 2
|newsarticlefeatures|+ |blogpostfeatures|(3.3)
From document similarity to topic links Given that the previously described fea-
tures directly compare only a pair of documents at a time, and not their topics. We
envisaged the following procedure in an effort to extrapolate this approach to explore
topic linking.
Using the saved results from each feature evaluated. We go over the list of linked
documents reading only the paired news articles corresponding to a single blog post.
The read pairs get stored to a temporal location and are ranked in descending order by
their computed score. Then, the highest ranked pair of IDs is individually looked-up in
the gold standard. If both are found under the same topic we consider them as a topical
match; using an auxiliary documents-per-topics index the pair gets assigned to the topic
id. Then, we clear the temporal storage and repeat the process by fetching the next blog
post with its corresponding news articles. Otherwise, if there is no match we read the
next news article ID from the temporal store and compare again in the same form. This
procedure is repeated until there are no pairs left in the results file to compare. At this
point, the mentioned auxiliary documents-per-topics index is compared against the gold
standard to compute the precision and recall values.
3.5.3 LDA
In the case of LDA, as the model directly infers the hidden topics in the news dataset, the
approach was slightly different. First, the saved lemmatized representation of the news
articles was loaded. This needs to be prepared in the format required by the algorithm.
A custom dictionary and a tokens-per-document matrix were generated. These are used
to train the LDA according to the parameters shown earlier.
6We also experimented with other threshold values, these presented here yielded the best results.
27
3.6. SUMMARY
We experimented with trimmed versions of the generated dictionary. The intuition
behind the trimmed version is to discard tokens present in more tan 50% of the documents
and words with less than 5 appearances. This way keeping only those most relevant
tokens.
After the LDA model was trained, it was used to get the tentative topics for the blog
post and news article. To do so each document was evaluated against the trained model.
To identify the topics in a document, the LDA algorithm compares which of the topics
per token distributions optimise the probability of the document under the inferred pos-
terior given the parameters. Then, it returns the topic index or indexes which maximised
this probability along the estimated probability.
The results obtained were saved to a text file. Using this file we were able to compare
what our model identified against the gold standard dataset. Again, to measure the
performance of our models we use the precision and recall metrics.
3.6 Summary
In this chapter, we described our project approach, reviewed our main aim and detailed
each step involved. We also described our datasets, the processing realised and how we
planed to compare the results.
We now proceed to show the experiments performed and their results.
28
4Experiments and Results
This chapter describes the experiments conducted, the observed results, our findings
and interpretations. In order to set the tone of this chapter, we introduce a description
of what we expected to observe on our testing.
4.1 Experimental Strategy
How can we identify documents covering the same topic? We explored dif-
ferent linking strategies. Firstly, we designed a special procedure aimed at scaling the
approaches based on document similarity to identify topic links. Secondly, in the case of
LDA its natural approach to identify topics turn it into a logic fit for this task.
What is an adequate method for topic identification? Judging by the number of
publications available, and the variety of research areas where it has been applied, LDA
is one of the most used models for this kind of tasks.
How to decide if a candidate link is relevant or not? From a quantitative per-
spective, we can define minimum thresholds in order to discriminate less relevant links.
From a qualitative perspective, our intuition was that a level of expertise in the area is
required to judge the validity of each link identified.
Is it possible to identify topical links if the documents of interest have a
largely different vocabulary or writing style between them? Our intuition was
that there needs to be a high similarity on the content of the documents covering the
same topic, otherwise, the automated method are not useful for this. We experimented
with different approaches to exploring the this feasibility.
29
4.2. EXPERIMENTS
Full Content No Boilerplate
Numer of blog posts in time period 4,201 2,124
Table 4.1: Number of blog posts to process, with and without filtering boilerplate
The rest of the chapter is devoted to presenting the explorations performed trying to
answer these questions.
4.1.1 Metrics for this evaluation.
As previously stated, we approached the evaluations using precision and recall.
Precission The proportion of retrieved documents that are actually relevant. What
percentage of the documents retrieved and linked to topic n coincide with the correspond-
ing list of the gold standard. As shown in equation 3.1 referencing elements from table
3.
Recall The proportion of relevant documents that were retrieved. What percentage of
the documents in topic n from the gold standard were correctly retrieved and linked. As
shown in equation 3.2 referencing elements from table 3.
4.2 Experiments
We now introduce the experiments conducted.
4.2.1 Removing boilerplate
By removing the boilerplate we effectively reduced the number of non-relevant content
which was processed. We analyse the impact this filtering has on the performance of our
approaches. Using tables 4.2.1 and 4.2.1 we present an initial quantitative comparison
on the impact this approach has on our linking efforts.
We also reproduce the content classification of a blog post with each line of its content
labelled as BP for boilerplate and TX for text.
BP—smaur (tangledwood) wrote,@ 2005-02-15 01:38:00
BP—Current mood:
BP—blown away
30
CHAPTER 4. EXPERIMENTS AND RESULTS
Blog post ID Full Content No Boilerplate
BLOG06-20051206-048-0010769450 506 336
BLOG06-20051206-048-0010794908 153 22
Table 4.2: Number of tokens to process in two documents, with and without filtering
boilerplate
BP—Current music:
BP—Motion City Soundtrack - the Future Freaks Me Out
BP—who is john galt?
Oh my god.
BP—Atlas Shrugged is just ...
BP—Read it. Now.
BP—(Post a new comment)
BP—ciarcimrene2005-02-15 06:55 am UTC(link)
BP—It really is a mind blowing book, isn’t it.
BP—I read it a year or two ago, in one night. Ayn Rand was quite a person.
BP—Haven’t heard from you in a while, hope you’re doing good and writing lots! :)
BP—(Reply to this) (Thread)
BP—tangledwood2005-02-15 12:13 pm UTC(link)
TX—It took me a week to read, mostly due to school and homework and stupid things
like that. But – wow, it was one heckuva book. I haven’t read any of her other books
(although I’m starting Anthem some time soon). But she’s a genius.
TX—Also: I haven’t been online a lot, due to copious amounts of schoolwork (semester
one just finished). And I need to get cracking on the writing thing.
BP—(Reply to this) (Parent) (Thread)
BP—ciarcimrene2005-02-15 01:49 pm UTC(link)
BP—Yes, you do need to! :P
BP—(Reply to this) (Parent)
BP—Log in now.
BP—(Create account, or use OpenID)
BP—forget your login?
BP—remember me
BP—Search: Category:
BP—Create an Account
BP—Update Your Journal
BP—Gift Shop
31
4.2. EXPERIMENTS
BP—Paid Accounts
BP—General Info
BP—Site News
BP—Paid Accounts
BP—Ask a Question
BP—Lost Password
BP—Site Map
BP—Browse Options
BP—Contact Info
BP—Terms of Service
BP—Privacy Policy
BP—Legal Information
BP—Site Map
BP—Browse Options
Now to compare the impact this filtering has in textual content, we present the result
of preprocessing the full textual content.
[’tangledwood’, ’john’, ’galt’, ’smaur’, ’tangledwood’, ’write’, ’2005’, ’current’, ’mood’,
’blow’, ’current’, ’music’, ’motion’, ’city’, ’soundtrack’, ’future’, ’freak’, ’john’, ’galt’,
’god’, ’atlas’, ’shrugged’, ’...’, ’wow’, ’read’, ’post’, ’comment’, ’ciarcimrene’, ’2005’, ’utc’,
’link’, ’mind’, ’blow’, ’book’, ’read’, ’night’, ’ayn’, ’rand’, ’person’, ’heard’, ’hope’, ’write’,
’lot’, ’reply’, ’thread’, ’tangledwood’, ’2005’, ’utc’, ’link’, ’week’, ’read’, ’school’, ’home-
work’, ’stupid’, ’wow’, ’heckuva’, ’book’, ’read’, ’book’, ’start’, ’anthem’, ’time’, ’genius’,
’online’, ’lot’, ’copious’, ’schoolwork’, ’semester’, ’finish’, ’crack’, ’write’, ’reply’, ’par-
ent’, ’thread’, ’ciarcimrene’, ’2005’, ’utc’, ’link’, ’reply’, ’parent’, ’log’, ’create’, ’account’,
’openid’, ’username’, ’password’, ’forget’, ’login’, ’remember’, ’search’, ’category’, ’site’,
’username’, ’site’, ’username’, ’mail’, ’region’, ’aol’, ’icq’, ’yahoo’, ’msn’, ’username’, ’jab-
ber’, ’navigate’, ’login’, ’create’, ’account’, ’update’, ’journal’, ’english’, ’espa’, ’deutsch’,
’’, ’search’, ’random’, ’region’, ’advanced’, ’school’, ’gift’, ’shop’, ’gift’, ’merchandise’,
’paid’, ’account’, ’add’, ’ons’, ’info’, ’press’, ’download’, ’site’, ’news’, ’paid’, ’account’,
’help’, ’question’, ’lost’, ’password’, ’faq’, ’site’, ’map’, ’browse’, ’option’, ’contact’, ’info’,
’term’, ’service’, ’privacy’, ’policy’, ’legal’, ’site’, ’map’, ’browse’, ’option’]
In comparison, the boilerplate free content after preprocessing becomes...
32
CHAPTER 4. EXPERIMENTS AND RESULTS
Feature Precission Recall
Cosine distance 19% 19%
Word overlap 9% 10%
k-spkip-n-grams 3% 5%
Table 4.3: Baseline and first features compared.
[’week’, ’read’, ’school’, ’homework’, ’stupid’, ’wow’, ’heckuva’, ’book’, ’read’, ’book’,
’start’, ’anthem’, ’time’, ’genius’, ’online’, ’lot’, ’copious’, ’schoolwork’, ’semester’, ’finish’,
’crack’, ’write’].
Upon initial observation, we argue removing boilerplate has a large affectation on the
data. Tables 4.4, 4.5, 4.6, and 4.7 compare the difference in performance that removing
boilerplate had on LDA.
4.3 Results
4.3.1 Cosine distance and word overlap
Our baseline was defined by the word overlap measure. Although we were surprised
by the low number of matches found by the k-skip-n-grams and higher n-grams. It seems
logical given the differences in grammar and writing styles observed between the blog
post and the news articles. The obtained performance is shown on Table 4.3.1.
4.3.2 LDA model
For LDA we tested the following scenarios: - Using only the values computed by
the gensim implementation of LDA. Then, take the highest ranked topic as the correct,
regardless of its value. - Setting a threshold of 30% to keep the number of documents with
multiple links low. - Setting a threshold of 15% as the minimum value to consider multiple
links. Then, using Kullback-Leibler as a second metric, the linkings are evaluated again.
If the highest scoring value is different from the original, both are kept. This was the
best performing achieved and is the one shown in the different tables comparing LDA.
As it is known, Kullback Leibler divergence computes the cost of encoding using a
distribution P while the real distribution is Q. In our case, that would be the cost of
using distribution P to generate a document instead of using distribution Q, which is the
”real” one.
33
4.3. RESULTS
Image 4.1 shows how the perplexity evolved during training. The observed evolution
could be explained as starting with les relevant topics, or more mixed topics, and then
through the iterations they become more specialized or meaningful.
Figure 4.1: Perplexity during training
Appendix D shows how the most frequent words in a topic change over training
iterations.
4.3.3 Observations on the Gold Standard
Our version of the gold standard was targeted to 5 topics, As we progressed to other
number of target topics, the precision and recall value droped. So, we needed to create
different versions of the golden standard focusing on matching the number of target
topics. This represented more work while also increases the chance of errors.
For this reason it would be wise to involve more people to improve the quality of the
gold standard.
4.3.4 Topic Coherence
A naive approach to analyzing topic coherence was performed by simple visualization
of the topics. To do so we used LDAVis [Carson Sievert, 2014]... An observed issue
was the seemingly duplication of topics as we increased the number of target topics.
Image 4.3 and 4.4 show this phenomena. This is one of the common errors described by
[Boyd-Graber et al., 2014]
Another issue observed in some of the intermediate trainings performed was the ex-
plained by [Hu et al., 2014] as ”bad” topics which do not make sense from the users’
perspective. These bad topics can confuse two or more themes into one topic; two differ-
ent topics can be (near) duplicates; and some topics make no sense at all.”
34
CHAPTER 4. EXPERIMENTS AND RESULTS
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5
city 0.021 march 0.012 company 0.021 tax 0.015 000 0.039
percent 0.010 2005 0.011 executive 0.012 security 0.014 bedroom 0.019
york 0.009 family 0.010 business 0.009 budget 0.012 house 0.016
mayor 0.007 love 0.009 chief 0.007 social 0.011 building 0.016
plan 0.006 wife 0.009 news 0.006 bush 0.011 market 0.016
stadium 0.006 york 0.009 medium 0.005 plan 0.009 bathroom 0.015
project 0.005 service 0.008 network 0.005 cut 0.009 tax 0.014
bloomberg 0.005 beloved 0.007 president 0.005 program 0.008 list 0.013
price 0.005 friend 0.007 television 0.005 benefit 0.008 broker 0.013
official 0.004 father 0.006 time 0.004 billion 0.007 week 0.012
Topic 6 Topic 7 Topic 8 Topic 9 Topic 10
police 0.013 art 0.013 american 0.010 court 0.027 jackson 0.015
child 0.010 museum 0.012 war 0.010 judge 0.023 boy 0.011
hospital 0.008 artist 0.007 military 0.008 law 0.014 police 0.007
medical 0.008 book 0.007 army 0.007 lawyer 0.013 day 0.006
family 0.007 design 0.005 official 0.007 federal 0.010 family 0.006
officer 0.006 york 0.004 report 0.007 justice 0.008 time 0.005
people 0.006 collection 0.004 iraq 0.007 schiavo 0.007 sign 0.005
doctor 0.006 time 0.004 intelligence 0.006 trial 0.006 woman 0.005
health 0.005 gallery 0.003 soldier 0.005 prosecutor 0.006 mother 0.004
care 0.005 painting 0.003 united 0.005 legal 0.006 rhp 0.004
Topic 11 Topic 12 Topic 13 Topic 14 Topic 15
government 0.012 win 0.017 president 0.011 church 0.027 airline 0.019
official 0.010 tournament 0.013 bush 0.011 book 0.014 flight 0.013
american 0.010 play 0.011 republican 0.009 god 0.008 plane 0.009
iraq 0.010 race 0.010 political 0.007 religious 0.007 travel 0.009
iraqi 0.009 lead 0.009 party 0.007 life 0.007 passenger 0.008
country 0.008 team 0.009 democrat 0.007 christian 0.006 airport 0.007
election 0.007 time 0.007 house 0.005 oil 0.006 ticket 0.007
united 0.007 victory 0.006 issue 0.005 religion 0.005 air 0.007
minister 0.006 final 0.006 leader 0.005 people 0.005 fly 0.007
shiite 0.005 seed 0.006 democratic 0.005 write 0.004 seat 0.006
Table 4.4: Top 10 frequent words in the first 15 topics of the 25 topics LDA model
35
4.3. RESULTS
Full Text Non Boilerplate
Topic Gold Std Ret Rel Prec Rec Ret Rel Prec Rec
1 16 33 0 0 0 44 16 0 0
2 8 87 0 0 0 86 8 0 0
3 9 150 2 1.3 22.2 134 9 1.4 22.2
4 220 170 74 43.5 33.6 158 220 39.8 28.6
5 51 7 2 28.5 3.9 7 51 28.5 3.9
Table 4.5: Performance for the LDA model trained on 5 topics
Full Text Non Boilerplate
Topic Gold Std Ret Rel Prec Rec Ret Rel Prec Rec
1 -
2 18 171 4 2.3 22.2 157 3 1.9 16.6
3 35 49 1 2.0 2.8 43 1 2.3 2.8
4 168 42 8 19.0 4.7 49 14 28.5 8.3
5 -
6 70 32 9 28.1 12.8 37 9 24.3 12.8
7 29 166 3 1.8 10.3 156 2 1.2 6.8
8 19 39 6 15.3 31.5 46 8 17.3 42.1
Table 4.6: Performance for the LDA model trained on 8 topics
Full Text Non Boilerplate
Topic Gold Std Ret Rel Prec Rec Ret Rel Prec Rec
1 9 9 0 0 0 14 0 0 0
2 9 34 0 0 0 37 0 0 0
3 110 40 8 20.0 7.2 40 10 25 9.0
4 36 126 6 4.7 16.6 124 8 6.4 22.2
5 13 22 0 0 0 24 0 0 0
6 25 63 1 1.5 4.0 60 0 0 0
7 -
8 28 80 7 8.7 25.0 76 7 9.2 25
9 25 55 1 1.8 4 58 1 1.7 4
10 55 94 15 15 27.2 77 6 7.7 10.9
Table 4.7: Performance for the LDA model trained on 10 topics
36
CHAPTER 4. EXPERIMENTS AND RESULTS
Full Text Non Boilerplate
Topics Precission Recall Precission Recall
5 17.4% 25.6% 15.6% 22.0%
8 6.2% 9.1% 6.5% 10.9%
10 7.2% 12.2% 6.2% 10.3$
Table 4.8: General precission and recall achieved by each LDA model.
Figure 4.2: Visual analysis of the infered topics
4.3.5 Additional observations
When using the multithreaded implementation of the LDA model attempring to train
on the full news corpus. Which activates batch training using Gibbs sampling. We
contantly ran into memory problems, there was an abnormally high RAM demand of
over 45 GB. Those issues forced us to abandon this type of training.
4.4 Summary
In this chapter, we presented our experimental design and results accomplished. Per-
fomance metrics and observations were also shown.
37
4.4. SUMMARY
Figure 4.3: Visualization of duplicated topics.
Now we proceed to present our conclussion and future work.
38
CHAPTER 4. EXPERIMENTS AND RESULTS
Figure 4.4: The topics being duplicated seemed to increase as we targeted a higher
number of them during training.
39
5Conclusion and Future Work
For the final chapter of this document we present our conclusion to this report, on the
other hand, we also mention possible future work.
5.1 Conclusion
Our experiments show LDA is an algorithm well oriented for this kind of task. Although
it is possible to use this method to link the blog posts to the news articles, the performance
achieved sugests it would be necessary to use additional features to improve what we
achieved. Otherwise the result would still seem to be far from perfect.
With the generated gold standard we hope to contribute to the advancement of this
research area.
5.2 Future Work
Based on the exploration we have done, we propose to expand this work by using the
extra features extracted from the corpus. Namely the news headlines, obtained from the
tags <hl1></hl1>Also the taxonomic keywords, added by the NYT editors and stored
inside the tags <classifier class=”” type=”descriptor”>.
It would also be relevant to explore using the syntactical parses of the documents.
That should enable retrieving more relevant documents.
As we also said in section 4. The gold standard should be improved by adding multiple
topic links per document and improving the categorization of the links by involving
multiple taggers or multiple evaluators.
40
CHAPTER 5. CONCLUSION AND FUTURE WORK
Also a relevant approach would be to evaluate based on using multiple features.
Perhaps by linking based on a voting comitee.
41
Bibliography
[Battisti et al., 2015] Francesca Battisti, Alfio Ferrara, and Silvia Salini. A decade of
research in statistics: a topic model approach. Scientometrics, 103(2):413–433, 2015.
[Blei, 2012] David M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77–84,
Apr 2012.
[Boyd-Graber et al., 2014] Jordan Boyd-Graber, David Mimno, and David Newman.
Care and Feeding of Topic Models: Problems, Diagnostics, and Improvements. CRC
Handbooks of Modern Statistical Methods. CRC Press, Boca Raton, Florida, 2014.
[Broder, 2000] Andrei Z. Broder. Identifying and filtering near-duplicate documents. In
Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, COM
’00, pages 1–10, LondoIMECS 2013, March 13 - 15, 2013IMECS 2013, March 13 - 15,
2013n, UK, UK, 2000. Springer-Verlag.
[Carson Sievert, 2014] Kenneth E. Shirley Carson Sievert. Ldavis: A method for visual-
izing and interpreting topics. In Proceedings of the Workshop on Interactive Language
Learning, Visualization, and Interfaces, ILLVI 2014, pages 63–70, Baltimore, Mary-
land, USA, 2014.
[Endredy and Novak, 2013] Istvan Endredy and Attila Novak. More Effective Boiler-
plate Removal - the GoldMiner Algorithm. Polibits - Research journal on Computer
science and computer engineering with applications, (48):79–83, 2013.
42
BIBLIOGRAPHY
[Fiscus and Doddington, 2002] Jonathan G. Fiscus and George R. Doddington. Topic
detection and tracking: Event-based information organization. chapter Topic Detection
and Tracking Evaluation Overview, pages 17–31. Springer US, Boston, MA, 2002.
[Han et al., 2013] Xiaohui Han, Jun Ma, Yun Wu, and Chaoran Cui. A novel machine
learning approach to rank web forum posts. Soft Computing, 18(5):941–959, 2013.
[Hoffman et al., 2010] Matthew D. Hoffman, David M. Blei, and Francis Bach. Online
learning for latent dirichlet allocation. In In NIPS, 2010.
[Hu et al., 2013] Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith.
Interactive topic modeling. Machine Learning, 95(3):423–469, 2013.
[Hu et al., 2014] Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith.
Interactive topic modeling. Machine Learning, 95:423–469, 2014.
[Macdonald and Ounis, 2006] Craig Macdonald and Iadh Ounis. The trec blog06 collec-
tion: Creating and analysing a blog test collection. Technical report, Department of
Computing Science, University of Glasgow, Scotland, United Kingdom, 2006.
[Madani et al., 2015] Amina Madani, Omar Boussaid, and Djamel Eddine Zegour. Real-
time trending topics detection and description from twitter content. Social Network
Analysis and Mining, 5(1):1–13, 2015.
[Manning et al., 2008] Christopher D. Manning, Prabhakar Raghavan, and Hinrich
Schutze. Introduction to Information Retrieval. Cambridge University Press, New
York, NY, USA, 2008.
[Newman and of Oxford. Reuters Institute for the Study of Journalism, 2009] N. New-
man and University of Oxford. Reuters Institute for the Study of Journalism. The
Rise of Social Media and Its Impact on Mainstream Journalism: A Study of how
Newspapers and Broadcasters in the UK and US are Responding to a Wave of Partic-
ipatory Social Media, and a Historic Shift in Control Towards Individual Consumers.
Working paper (Reuters Institute for the Study of Journalism). University of Oxford,
Reuters Institute for the Study of Journalism, 2009.
[Qi et al., 2015] Xiang Qi, Yu Huang, Ziyan Chen, Xiaoyan Liu, Jing Tian, Tinglei
Huang, and Hongqi Wang. Burst-lda: A new topic model for detecting bursty topics
from stream text. Journal of Electronics (China), 31(6):565–575, 2015.
43
BIBLIOGRAPHY
[Rabin, 1981] M.O. Rabin. Fingerprinting by Random Polynomials. Center for Research
in Computing Technology: Center for Research in Computing Technology. Center for
Research in Computing Techn., Aiken Computation Laboratory, Univ., 1981.
[Rehurek and Sojka, 2010] Radim Rehurek and Petr Sojka. Software framework for topic
modelling with large corpora. In Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA.
[Rose et al., 2010] Stuart Rose, Dave Engel, Nick Cramer, and Wendy Cowley. Auto-
matic keyword extraction from individual documents. In Michael W. Berry and Jacob
Kogan, editors, Text Mining. Applications and Theory, pages 1–20. John Wiley and
Sons, Ltd, 2010.
[Salton et al., 1975] G. Salton, A. Wong, and C. S. Yang. A vector space model for
automatic indexing. Commun. ACM, 18(11):613–620, November 1975.
[Sandhaus, 2008] E. Sandhaus. The new york times annotated corpus. Linguistic Data
Consortium, Philadelphia, 6(12), 2008.
[Singhal, 2001] Amit Singhal. Modern Information Retrieval: A Brief Overview. Bulletin
of the IEEE Computer Society Technical Committee on Data Engineering, 24(4):35–42,
2001.
[Stilo and Velardi, 2016] Giovanni Stilo and Paola Velardi. Efficient temporal mining of
micro-blog texts and its application to event discovery. Data Mining and Knowledge
Discovery, 30(2):372–402, 2016.
[Suphakit Niwattanakul and Wanapu, 2013] Ekkachai Naenudorn Suphakit Niwat-
tanakul, Jatsada Singthongchai and Supachanun Wanapu. Using of jaccard coeffi-
cient for keywords similarity. In Proceedings of the International MultiConference of
Engineers and Computer Scientists, IMECS 2013, 2013.
[Tristan et al., 2015] Jean-Baptiste Tristan, Joseph Tassarotti, and Guy L. Steele Jr.
Efficient training of lda on a gpu by mean-for-mode estimation. In Francis R. Bach
and David M. Blei, editors, ICML, volume 37 of JMLR Proceedings, pages 59–68.
JMLR.org, 2015.
[van Rijsbergen, 1979] C.J. van Rijsbergen. Information Retrieval. 1979.
[Vorontsov and Potapenko, 2014] Konstantin Vorontsov and Anna Potapenko. Additive
regularization of topic models. Machine Learning, 101(1):303–323, 2014.
44
BIBLIOGRAPHY
[Wang et al., 2015] Chi Wang, Jialu Liu, Nihit Desai, Marina Danilevsky, and Jiawei
Han. Constructing topical hierarchies in heterogeneous information networks. Knowl-
edge and Information Systems, 44(3):529–558, 2015.
[Yan et al., 2013] Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. A biterm
topic model for short texts. In Proceedings of the 22Nd International Conference on
World Wide Web, WWW ’13, pages 1445–1456, New York, NY, USA, 2013. ACM.
[Zhang and Zhang, 2012] Yong-Heng Zhang and Feng Zhang. In De-Shuang Huang, Jian-
hua Ma, Kang-Hyun Jo, and M. Michael Gromiha, editors, Intelligent Computing The-
ories and Applications: 8th International Conference, ICIC 2012, Huangshan, China,
July 25-29, 2012. Proceedings, chapter Research on New Algorithm of Topic-Oriented
Crawler and Duplicated Web Pages Detection, pages 35–42. Springer Berlin Heidel-
berg, Berlin, Heidelberg, 2012.
45
Appendices
46
A Blogs 06 full XML sample
<DOC>
<DOCNO>BLOG06-20051206-005-0002264497</DOCNO>
<DATE_XML>2005-02-04T17:17:57+0000</DATE_XML>
<FEEDNO>BLOG06-feed-000428</FEEDNO>
<FEEDURL>http://changeforkentucky.com/index.rdf#</FEEDURL>
<BLOGHPNO>BLOG06-bloghp-000428</BLOGHPNO>
<BLOGHPURL>http://changeforkentucky.com/#</BLOGHPURL>
<PERMALINK>http://changeforkentucky.com/archives/000184.html#</PERMALINK>
<DOCHDR>
http://changeforkentucky.com/archives/000184.html# 0.0.0.0 20051220205822 8569
Date: Tue, 20 Dec 2005 20:58:21 GMT
Accept-Ranges: bytes
ETag: "1b12d4-2005-8688dc00"
Server: Apache/2.0.50 (FreeBSD) DAV/2 SVN/1.0.5
Content-Length: 8197
Content-Type: text/html; charset=ISO-8859-1
Last-Modified: Fri, 25 Feb 2005 16:36:00 GMT
Client-Date: Tue, 20 Dec 2005 20:58:22 GMT
Client-Response-Num: 1
Proxy-Connection: close
X-Cache: MISS from paya.dcs.gla.ac.uk
</DOCHDR>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Change for Kentucky: Change for Kentucky Day at the Capitol</title>
<link rel="stylesheet" href="http://changeforkentucky.com/styles-site.css"
type="text/css" />
<link rel="alternate" type="application/rss+xml" title="RSS"
href="http://changeforkentucky.com/index.rdf" />
<link rel="start" href="http://changeforkentucky.com/" title="Home" />
<link rel="prev" href="http://changeforkentucky.com/archives/000183.html"
title="From the Ground Up" />
<link rel="next" href="http://changeforkentucky.com/archives/000185.html"
title="Meet Up is Wednesday, March 2, 2005" />
<script type="text/javascript" language="javascript">
47
A. BLOGS 06 FULL XML SAMPLE
<!--
function OpenTrackback (c) {
window.open(c,
’trackback’,
’width=480,height=480,scrollbars=yes,status=yes’);
}
var HOST = ’changeforkentucky.com’;
// Copyright (c) 1996-1997 Athenia Associates.
// http://www.webreference.com/js/
// License is granted if and only if this entire
// copyright notice is included. By Tomer Shiran.
function setCookie (name, value, expires, path, domain, secure) {
var curCookie = name + "=" + escape(value) + ((expires) ? "; expires=" +
expires.toGMTString() : "") + ((path) ? "; path=" + path : "") + ((domain)
? "; domain=" + domain : "") + ((secure) ? "; secure" : "");
document.cookie = curCookie;
}
function getCookie (name) {
var prefix = name + ’=’;
var c = document.cookie;
var nullstring = ’’;
var cookieStartIndex = c.indexOf(prefix);
if (cookieStartIndex == -1)
return nullstring;
var cookieEndIndex = c.indexOf(";", cookieStartIndex + prefix.length);
if (cookieEndIndex == -1)
cookieEndIndex = c.length;
return unescape(c.substring(cookieStartIndex + prefix.length, cookieEndIndex));
}
function deleteCookie (name, path, domain) {
if (getCookie(name))
document.cookie = name + "=" + ((path) ? "; path=" + path : "") + ((domain) ?
"; domain=" + domain : "") + "; expires=Thu, 01-Jan-70 00:00:01 GMT";
}
function fixDate (date) {
var base = new Date(0);
var skew = base.getTime();
if (skew > 0)
date.setTime(date.getTime() - skew);
}
function rememberMe (f) {
48
var now = new Date();
fixDate(now);
now.setTime(now.getTime() + 365 * 24 * 60 * 60 * 1000);
setCookie(’mtcmtauth’, f.author.value, now, ’’, HOST, ’’);
setCookie(’mtcmtmail’, f.email.value, now, ’’, HOST, ’’);
setCookie(’mtcmthome’, f.url.value, now, ’’, HOST, ’’);
}
function forgetMe (f) {
deleteCookie(’mtcmtmail’, ’’, HOST);
deleteCookie(’mtcmthome’, ’’, HOST);
deleteCookie(’mtcmtauth’, ’’, HOST);
f.email.value = ’’;
f.author.value = ’’;
f.url.value = ’’;
}
//-->
</script>
<!--
<rdf:RDF xmlns="http://web.resource.org/cc/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<Work rdf:about="http://changeforkentucky.com/archives/000184.html">
<dc:title>Change for Kentucky Day at the Capitol</dc:title>
<dc:description>Join fellow Change for Kentucky members from around the state
for a day at the state capital in Frankfort, Thursday, February 17, 2005.
The day will be designed to make us more effective advocates. We will learn
our way around,...</dc:description>
<dc:creator>Allison Webster</dc:creator>
<dc:date>2005-02-04T12:17:57-05:00</dc:date>
<license rdf:resource="http://creativecommons.org/licenses/by-sa/1.0/" />
</Work>
<License rdf:about="http://creativecommons.org/licenses/by-sa/1.0/">
<requires rdf:resource="http://web.resource.org/cc/Attribution" />
<requires rdf:resource="http://web.resource.org/cc/Notice" />
<requires rdf:resource="http://web.resource.org/cc/ShareAlike" />
<permits rdf:resource="http://web.resource.org/cc/Reproduction" />
<permits rdf:resource="http://web.resource.org/cc/Distribution" />
<permits rdf:resource="http://web.resource.org/cc/DerivativeWorks" />
</License>
</rdf:RDF>
-->
49
A. BLOGS 06 FULL XML SAMPLE
</head>
<body>
<div id="banner">
<h1><a href="http://changeforkentucky.com/" accesskey="1">Change for
Kentucky</a></h1>
<span class="description">Change for Kentucky</span>
</div>
<div id="container">
<div class="blog">
<div id="menu">
<a href="http://changeforkentucky.com/archives/000183.html">« From the
Ground Up</a> |
<a href="http://changeforkentucky.com/">Main</a>
| <a href="http://changeforkentucky.com/archives/000185.html">Meet Up is
Wednesday, March 2, 2005 »</a>
</div>
</div>
<div class="blog">
<h2 class="date">February 04, 2005</h2>
<div class="blogbody">
<h3 class="title">Change for Kentucky Day at the Capitol</h3>
<p>Join fellow Change for Kentucky members from around the state for a day<br />
at the state capital in Frankfort, Thursday, February 17, 2005. </p>
<p><img src="http://changeforkentucky.com/archives/frankcap.gif" width="275"
height="205" border="0" /></p>
<p>The day will be designed to make us more effective advocates. We will<br />
learn our way around, observe committee meetings scheduled for the day,<br />
meet with legislators and watch the Senate and House in session from<br />
the chamber galleries. </p>
<p>Please be thinking about whether you might be able to join us for at<br />
least part of the day. It will be a good opportunity for you to meet<br />
other Change for Kentucky supporters from across the state.</p>
<p>Contact <a href="mailto:[email protected]">Libby Marshall</a> or <a
href="mailto:[email protected]">Lacey McNary</a> from CFK/Frankfort for
more information, of if you would like to volunteer to be a host or
guide.</p>
<p><br />
</p>
<a name="more"></a>
<span class="posted">Posted by Allison Webster at February 4, 2005 12:17 PM
<br /></span>
50
</div>
<div class="comments-head"><a name="comments"></a>Comments</div>
<div class="comments-head">Post a comment</div>
<div class="comments-body">
<form method="post" action="http://changeforkentucky.com/mt/mt-comments.cgi"
name="comments_form" onsubmit="if (this.bakecookie[0].checked)
rememberMe(this)">
<input type="hidden" name="static" value="1" />
<input type="hidden" name="entry_id" value="184" />
<div style="width:180px; padding-right:15px; margin-right:15px; float:left;
text-align:left; border-right:1px dotted #bbb;">
<label for="author">Name:</label><br />
<input tabindex="1" id="author" name="author" /><br /><br />
<label for="email">Email Address:</label><br />
<input tabindex="2" id="email" name="email" /><br /><br />
<label for="url">URL:</label><br />
<input tabindex="3" id="url" name="url" /><br /><br />
</div>
<!-- Security Code Check -->
<label for="code">Security Code:</label><br />
<input type="hidden" id="code" name="code" value="42" />
<img border="0" src="http://changeforkentucky.com/mt/mt-scode.cgi?code=42
"><br />
<input tabindex=3 id="scode" name="scode" /><br /><br />
<!-- end of Security Code Check -->
Remember personal info?<br />
<input type="radio" id="bakecookie" name="bakecookie" /><label
for="bakecookie">Yes</label><input type="radio" id="forget"
name="bakecookie" onclick="forgetMe(this.form)" value="Forget Info"
style="margin-left: 15px;" /><label for="forget">No</label><br
style="clear: both;" />
<label for="text">Comments:</label><br />
<textarea tabindex="4" id="text" name="text" rows="10" cols="50"></textarea><br
/><br />
<input type="submit" name="preview" value=" Preview " />
<input style="font-weight: bold;" type="submit" name="post"
value=" Post " /><br /><br />
</form>
<script type="text/javascript" language="javascript">
<!--
document.comments_form.email.value = getCookie("mtcmtmail");
51
B. NYT 06 ANNOTATED CORPUS FULL XML SAMPLE
document.comments_form.author.value = getCookie("mtcmtauth");
document.comments_form.url.value = getCookie("mtcmthome");
if (getCookie("mtcmtauth")) {
document.comments_form.bakecookie[0].checked = true;
} else {
document.comments_form.bakecookie[1].checked = true;
}
//-->
</script>
</div>
</div>
</div>
</body>
</html>
</DOC>
B NYT 06 Annotated Corpus full XML sample
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE nitf SYSTEM
"http://www.nitf.org/IPTC/NITF/3.3/specification/dtd/nitf-3-3.dtd">
<nitf change.date="June 10, 2005" change.time="19:30" version="-//IPTC//DTD
NITF 3.3//EN">
<head>
<title>National Briefing | Washington: Energy Secretary Confirmed</title>
<meta content="MB012055" name="slug"/>
<meta content="1" name="publication_day_of_month"/>
<meta content="2" name="publication_month"/>
<meta content="2005" name="publication_year"/>
<meta content="Tuesday" name="publication_day_of_week"/>
<meta content="National Desk" name="dsk"/>
<meta content="14" name="print_page_number"/>
<meta content="A" name="print_section"/>
<meta content="3" name="print_column"/>
<meta content="U.S." name="online_sections"/>
<docdata>
<doc-id id-string="1646630"/>
<doc.copyright holder="The New York Times" year="2005"/>
<identified-content>
52
<org class="indexing_service">Energy Department</org>
<person class="indexing_service">Hulse, Carl</person>
<person class="indexing_service">Bodman, Samuel W</person>
<classifier class="online_producer"
type="taxonomic_classifier">Top/News/U.S.</classifier>
</identified-content>
</docdata>
<pubdata date.publication="20050201T000000"
ex-ref="http://query.nytimes.com/gst/fullpage.html?res=9A03E6D9123BF932A35751C0A9639C8B63"
item-length="168" name="The New York Times" unit-of-measure="word"/>
</head>
<body>
<body.head>
<hedline>
<hl1>National Briefing | Washington: Energy Secretary Confirmed</hl1>
</hedline>
<byline class="print_byline">By Carl Hulse (NYT)</byline>
<byline class="normalized_byline">Hulse, Carl</byline>
<abstract>
<p>Senate confirms new Energy Secretary Samuel W Bodman by voice vote; photo
(S)</p>
</abstract>
</body.head>
<body.content>
<block class="lead_paragraph">
<p>The Senate confirmed the new energy secretary, Samuel W. Bodman, left, by
voice vote, completing a smooth trip through the Senate by a man who
previously held senior positions in the Commerce and Treasury Departments.
A one-time chemical engineering professor, Mr. Bodman replaces Spencer
Abraham as the head of the agency responsible for a energy issues and
research. He is expected to play a role as the House and Senate try once
again to come to terms on wide-ranging energy legislation and decide
whether to allow drilling in the Arctic National Wildlife Refuge.</p>
<p>Carl Hulse (NYT)</p>
</block>
<block class="full_text">
<p>The Senate confirmed the new energy secretary, Samuel W. Bodman, left, by
voice vote, completing a smooth trip through the Senate by a man who
previously held senior positions in the Commerce and Treasury Departments.
A one-time chemical engineering professor, Mr. Bodman replaces Spencer
Abraham as the head of the agency responsible for a energy issues and
53
C. GOLD STANDARD FILE
research. He is expected to play a role as the House and Senate try once
again to come to terms on wide-ranging energy legislation and decide
whether to allow drilling in the Arctic National Wildlife Refuge.</p>
<p>Carl Hulse (NYT)</p>
</block>
</body.content>
</body>
</nitf>
C Gold Standard file
;# books
# {8:6,10:4}
1647594
1647608
1652294
1654651
1654657
1654660
1654664
1660017
1660654
BLOG06-20051207-015-0004669977 | 20050219T140600
BLOG06-20051207-027-0003414242 | 20050303T201400
BLOG06-20051206-053-0022922844 | 20050325T003546
BLOG06-20051206-053-0022554608 | 20050331T213525
; business
# {5:2,8:1,10:0}
1647482
1652396
1656251
1656311
1658134
1661166
BLOG06-20051207-000-0018149515 | 20050209T144119
BLOG06-20051206-005-0012663571 | 20050211T042100
BLOG06-20051207-033-0011353957 | 20050321T120100
54
; arts/plays, movies, music
# {8:1,10:1}
; concerts | gigs
BLOG06-20051206-012-0011643299 | 20050323T162200
;# movies
1646773
1649017
1649475
1652996
1653775
1655061
BLOG06-20051206-017-0012074606 | 20050306T224722
;# music
1660506
; education 5
# {5:4,8:5,10:8}
1647660
1651293
BLOG06-20051207-033-0016323399 | 20050207T082032
BLOG06-20051207-033-0016307364 | 20050207T082040
;# health
# {5:4,8:5,10:7}
1646954
1652377
1653182
1658953
1660654
1660672
BLOG06-20051206-022-0010119325 | 20050209T012300
BLOG06-20051207-041-0014474902 | 20050209T090454
BLOG06-20051206-022-0010088187 | 20050210T021800
BLOG06-20051206-019-0007672170 | 20050210T131400
BLOG06-20051206-019-0007656300 | 20050210T135700
BLOG06-20051207-099-0004345988 | 20050211T140051
;# health | abortion
1646854
1647555
1647660
1647942
55
C. GOLD STANDARD FILE
1647944
1647946
1648099
1648281
1649117
1649475
1650385
1650596
1654325
1654971
BLOG06-20051207-041-0014368762 | 20050209T152507
BLOG06-20051207-062-0017788843 | 20050311T102600
; sexual preferences
;# marriage/legislation/rights
# {5:3,8:5,10:5}
1646589
1646894
1647017
1647318
1647415
1647594
1647608
1647969
1648292
1648402
BLOG06-20051206-008-0019851299 | 20050202T155600
BLOG06-20051207-056-0016211866 | 20050204T174106
BLOG06-20051207-056-0015976383 | 20050207T135042
BLOG06-20051207-056-0015924528 | 20050207T185742
BLOG06-20051207-056-0015685587 | 20050208T141601
BLOG06-20051207-056-0015613582 | 20050208T144553
BLOG06-20051207-041-0014325770 | 20050210T191520
; social theories, religion
# {5:1,8:5,10:5}
1647470
1654653
1658064
1658327
BLOG06-20051207-000-0018070715 | 20050225T135600
56
BLOG06-20051207-000-0018043942 | 20050301T151800
;# religion
BLOG06-20051206-003-0021250584 | 20050326T145800
BLOG06-20051206-003-0021215450 | 20050327T125300
; human rights
# {5:3,8:3,10:8}
1658961
1659080
1660940
1660972
BLOG06-20051207-056-0016282311 | 20050204T145144
; journalism
1648754
1649276
1649918
1650628
BLOG06-20051207-037-0018070824 | 20050201T164829
BLOG06-20051207-043-0003056710 | 20050208T142348
BLOG06-20051207-093-0014990657 | 20050208T211923
BLOG06-20051207-030-0004949018 | 20050215T193400
BLOG06-20051207-037-0017099181 | 20050216T044100
;# law | civics | city | court, judge, major | human rights | same sex marriage
# {5:3,8:3,10:8}
1649714
1652379
1659115
1660723
BLOG06-20051207-025-0018688681 | 20050306T205200
BLOG06-20051206-014-0006802692 | 20050317T023921
BLOG06-20051206-014-0006708448 | 20050323T163334
; military and politics go here
; sports - baseball
# {5:0,8:6,10:6}
1650417
1652433
1658187
57
C. GOLD STANDARD FILE
BLOG06-20051207-067-0014499680 | 20050328T155000
BLOG06-20051207-067-0014432379 | 20050331T183300
; sports - super bowl
1647036
1648054
1648065
1648072
1648133
1648171
1648179
1648182
1658363
BLOG06-20051207-056-0016187067 | 20050204T183914
BLOG06-20051207-056-0016081103 | 20050206T193146
;# technology | mobile phones | blogs
# {8:5}
1658775
1661144
BLOG06-20051207-090-0016847958 | 20050316T094649
BLOG06-20051207-090-0016817643 | 20050316T095358
; blogs
1648060
1648733
1649545
1652417
1656964
BLOG06-20051207-100-0025856345 | 20050209T174600
BLOG06-20051207-010-0001842471 | 20050227T110200
BLOG06-20051207-001-0010918632 | 20050301T050900
BLOG06-20051206-003-0021128252 | 20050330T025100
; terrorism
;# politics | terrorism
# {5:3,8:2,10:9}
; torture
1656303
1656306
; terror
1660940
1661155
58
1661163
BLOG06-20051207-056-0015738134 | 20050208T072956
; terrorism, academic freedom
1647047
1648116
1649108
BLOG06-20051207-060-0014521124 | 20050215T205800
;# politics | terrorism | 9-11
1646478
1646846
1646997
BLOG06-20051207-056-0016624520 | 20050203T090913
BLOG06-20051207-043-0002928883 | 20050210T195900
;# tsunami - UN
1647460
1650434
1652579
BLOG06-20051207-041-0014632987 | 20050205T160909
; v-day 17
# {5:4,8:7,10:3}
1649546
1649585
1649599
1649602
1649960
BLOG06-20051207-095-0004262868 | 20050207T122639
BLOG06-20051207-027-0011471287 | 20050208T225724
BLOG06-20051207-091-0028667535 | 20050211T155337
BLOG06-20051207-095-0011162242 | 20050213T024426
BLOG06-20051207-061-0009882858 | 20050215T000100
BLOG06-20051207-061-0009860129 | 20050215T023000
BLOG06-20051207-100-0003458648 | 20050215T034800
BLOG06-20051207-099-0032891589 | 20050215T062844
BLOG06-20051207-013-0002722984 | 20050215T155600
; work - check what topic gets assigned
1655377
1655432
59
C. GOLD STANDARD FILE
1661228
BLOG06-20051207-041-0014516901 | 20050208T145144
BLOG06-20051207-102-0013289154 | 20050206T173727
; military 8
# {5:3,8:3,10:3}
1647074
1647107
1647374
1647889
1649226
1650339
1650826
BLOG06-20051207-056-0016578672 | 20050203T112616
BLOG06-20051207-000-0018200248 | 20050203T131249
BLOG06-20051207-056-0015771538 | 20050207T223644
BLOG06-20051207-060-0014589107 | 20050210T174114
BLOG06-20051207-060-0014575760 | 20050211T164100
BLOG06-20051206-022-0018529402 | 20050211T222400
BLOG06-20051207-000-0018123655 | 20050212T113100
BLOG06-20051207-033-0012313484 | 20050214T000913
BLOG06-20051206-022-0018492433 | 20050307T152200
BLOG06-20051206-056-0021994743 | 20050316T055441
;# military | iraq war
# {5:3,8:3,10:9}
1658201
1658202
BLOG06-20051207-043-0003351020 | 20050202T082933
BLOG06-20051207-056-0016843191 | 20050202T100403
BLOG06-20051207-056-0016812432 | 20050202T101502
BLOG06-20051207-056-0016460322 | 20050203T222103
BLOG06-20051207-043-0003098865 | 20050207T161258
BLOG06-20051207-056-0015716097 | 20050208T085557
BLOG06-20051207-041-0014433731 | 20050209T134551
BLOG06-20051207-043-0002971527 | 20050209T150007
BLOG06-20051207-043-0002844624 | 20050211T102500
BLOG06-20051207-060-0014549178 | 20050213T093700
;# military torture
1648322
1649394
60
1650883
1651927
1652830
1658077
BLOG06-20051207-023-0011378054 | 20050213T223800
BLOG06-20051207-056-0020623436 | 20050215T211300
;------------------
; politics
# {5:3,8:3,10:2}
;# pollitics | budget |
1653919
1658088
BLOG06-20051207-023-0011215456 | 20050228T195000
BLOG06-20051207-023-0010994236 | 20050319T120400
BLOG06-20051207-023-0010973646 | 20050319T120600
;# bush, state of the union
1647017
1647318
;--------------
1647471
1647558
1653861
1656274
1656282
1658176
1661152
1661174
1661182
1661222
1661239
BLOG06-20051206-008-0019893981 | 20050201T165600
BLOG06-20051206-008-0019872653 | 20050201T171200
BLOG06-20051207-036-0000970906 | 20050201T172234
BLOG06-20051207-049-0037430772 | 20050201T212306
BLOG06-20051206-008-0019829436 | 20050202T155900
BLOG06-20051207-056-0016673321 | 20050202T222544
BLOG06-20051206-008-0019765331 | 20050203T153000
BLOG06-20051207-056-0016548794 | 20050203T153628
61
C. GOLD STANDARD FILE
BLOG06-20051207-056-0016337498 | 20050204T072922
BLOG06-20051207-000-0018175453 | 20050204T134200
BLOG06-20051207-036-0000871271 | 20050204T142814
BLOG06-20051207-056-0016258997 | 20050204T154017
BLOG06-20051206-005-0002273537 | 20050204T170331
BLOG06-20051207-043-0003267613 | 20050204T202015
BLOG06-20051206-008-0019721798 | 20050205T121900
BLOG06-20051207-043-0003182847 | 20050207T110933
BLOG06-20051207-056-0016059286 | 20050207T112650
BLOG06-20051207-056-0016030245 | 20050207T113652
BLOG06-20051207-056-0015800767 | 20050207T203945
BLOG06-20051207-060-0014642721 | 20050208T161421
BLOG06-20051207-060-0014629128 | 20050208T163118
BLOG06-20051207-060-0014615876 | 20050208T165852
BLOG06-20051207-006-0019619373 | 20050208T222338
BLOG06-20051207-056-0010698413 | 20050209T115825
BLOG06-20051207-041-0014454358 | 20050209T133514
BLOG06-20051207-041-0014413462 | 20050209T135114
BLOG06-20051206-008-0019592002 | 20050210T112300
BLOG06-20051207-037-0017857431 | 20050211T083600
BLOG06-20051206-008-0019548498 | 20050211T134600
BLOG06-20051207-037-0017787605 | 20050212T073800
BLOG06-20051207-060-0014562172 | 20050212T124900
BLOG06-20051206-008-0019526540 | 20050212T130200
BLOG06-20051206-003-0021156589 | 20050330T020800
BLOG06-20051207-023-0011399192 | 20050213T223100
;# pollitics | scandals
BLOG06-20051207-030-0004949018 | 20050215T193400
;# politics | scandals | gomery
BLOG06-20051206-008-0019634846 | 20050203T150200
BLOG06-20051206-008-0019700507 | 20050206T141100
BLOG06-20051206-008-0019808106 | 20050209T111200
;# pollitics | Bush
;# {5:3,8:2,10:2}
1647434
1647456
1647458
1647476
1647478
1647484
1647500
62
1647506
1647507
1647511
1647542
1647556
1653834
1653862
1653871
1656244
1656258
1656259
1656277
1658066
1658114
1658119
1658122
1658140
1658155
1661153
1661176
1661177
1661210
1661211
1661262
BLOG06-20051207-056-0016788974 | 20050202T112449
BLOG06-20051207-043-0003140787 | 20050207T134831
BLOG06-20051207-041-0014538284 | 20050207T151421
BLOG06-20051207-056-0015865902 | 20050207T190733
BLOG06-20051207-043-0003014765 | 20050209T140651
BLOG06-20051207-043-0002886450 | 20050210T213000
BLOG06-20051207-043-0002802327 | 20050211T104800
BLOG06-20051207-041-0014280422 | 20050215T082038
BLOG06-20051207-023-0011236141 | 20050228T194700
BLOG06-20051207-010-0002082101 | 20050225T154400
BLOG06-20051207-023-0010911305 | 20050328T200000
;# politics | europe | uk | Condolesa Rice
; # {5:3,8:2,10:2}
1647434
1658130
;# politics | europe
1658076
63
D. TOPIC EVOLUTION DURING TRAINING
BLOG06-20051207-033-0011010927 | 20050329T101500
BLOG06-20051207-033-0011859456 | 20050310T092200
BLOG06-20051207-033-0012150165 | 20050221T080600
BLOG06-20051207-033-0011916972 | 20050309T125700
;# politics | world | europe | royals
BLOG06-20051206-008-0019613577 | 20050210T111700
BLOG06-20051206-008-0019570424 | 20050211T110000
;# politics | world | europe | russia
1647413
1653831
1661232
;# middle east
# {5:3,8:2,10:9}
1647458
BLOG06-20051207-023-0011035213 | 20050316T200100
BLOG06-20051207-056-0020548713 | 20050220T212800
BLOG06-20051207-055-0002997060 | 20050303T161616
;# middle east | Iraq
# {5:3,8:2,10:9}
1658201
1658202
BLOG06-20051207-043-0003351020 | 20050202T082933
BLOG06-20051207-056-0016843191 | 20050202T100403
BLOG06-20051207-056-0016812432 | 20050202T101502
BLOG06-20051207-056-0016460322 | 20050203T222103
BLOG06-20051207-043-0003098865 | 20050207T161258
BLOG06-20051207-056-0015716097 | 20050208T085557
BLOG06-20051207-041-0014433731 | 20050209T134551
BLOG06-20051207-043-0002971527 | 20050209T150007
BLOG06-20051207-043-0002844624 | 20050211T102500
BLOG06-20051207-060-0014549178 | 20050213T093700
D Topic Evolution During Training
64
T1 T2 T3 T4 T5
0.004 time 0.004 time 0.004 time 0.004 time 0.005 time
0.004 day 0.004 street 0.004 music 0.004 street 0.004 street
0.004 music 0.004 music 0.004 street 0.004 music 0.004 music
0.003 street 0.003 play 0.004 play 0.004 play 0.004 art
0.003 play 0.003 day 0.004 art 0.004 art 0.004 play
0.003 people 0.003 art 0.003 day 0.003 day 0.003 film
0.003 art 0.003 people 0.003 people 0.003 film 0.003 day
0.003 york 0.003 york 0.003 film 0.003 people 0.003 people
0.003 book 0.003 film 0.003 york 0.003 theater 0.003 theater
0.003 theater 0.003 theater 0.003 life 0.003 life 0.003 life
T6 T7 T8 T9 T10
0.005 time 0.005 time 0.005 time 0.005 time 0.005 time
0.004 street 0.004 street 0.004 street 0.004 art 0.004 art
0.004 music 0.004 art 0.004 art 0.004 street 0.004 street
0.004 art 0.004 music 0.004 music 0.004 music 0.004 music
0.004 play 0.004 play 0.004 play 0.004 play 0.004 play
0.003 film 0.003 film 0.004 film 0.004 film 0.004 film
0.003 day 0.003 theater 0.003 theater 0.003 theater 0.003 theater
0.003 theater 0.003 day 0.003 life 0.003 life 0.003 life
0.003 life 0.003 life 0.003 day 0.003 day 0.003 book
0.003 people 0.003 people 0.003 people 0.003 people 0.003 people
T11 T12 T13 T14 T15
0.005 time 0.005 time 0.005 time 0.005 time 0.005 time
0.004 art 0.005 art 0.004 art 0.005 art 0.005 art
0.004 music 0.004 music 0.004 music 0.005 music 0.004 music
0.004 play 0.004 street 0.004 play 0.004 film 0.004 film
0.004 street 0.004 play 0.004 film 0.004 play 0.004 play
0.004 film 0.004 film 0.004 street 0.004 street 0.004 street
0.003 theater 0.003 theater 0.003 theater 0.003 theater 0.003 theater
0.003 life 0.003 life 0.003 life 0.003 life 0.003 life
0.003 book 0.003 book 0.003 book 0.003 book 0.003 book
0.003 people 0.003 people 0.003 people 0.003 people 0.003 include
Table 1: Topic evolution through training iterations
65