26
1 Making a topic dictionary for semi-supervised classification of the UN speeches Kohei Watanabe (Waseda University) Yuan Zhou (Kobe University) Paper presented at Quantitative Text Analysis Dublin Workshop, 2019 Abstract There is a growing interest in quantitative analysis of large corpora among the International Relations (IR) scholars, but many of them find it difficult to perform analysis consistently with existing theoretical frameworks using machine learning models to further develop the field. To solve this problem, we developed a set of techniques that utilize a dictionary-based semi-supervised model that allows researchers to classify documents into pre-defined categories efficiently. We propose a dictionary making procedure to avoid inclusion of words that are likely to confuse the model and deteriorate the classification accuracy using a new entropy-based diagnostic tool. In our experiments, we classify sentences of the UN General Assembly speeches into six pre-defined categories using the semi-supervised models (seeded LDA and Newsmap), trained with a small “seed dictionary” that we create following the procedure. The result shows that, while simple dictionary classification leaves 75% of sentences unclassified, Newsmap can classify over 60% of them accurately; its accuracy becomes over 70% when contextual information is taken into consideration by kernel smoothing of topic probabilities. We argue that once a topic dictionary is created by the IR community, semi-supervised models can be an alternative to unsupervised topic classification models. Discourse analysis has been widely used in studies of International Relations (IR) since the ‘Third Great Debate’ (Banta, 2013; Holzscheiter, 2014; Lundborg & Vaughan-Williams, 2015; Milliken, 1999), in which the constructivist scholars challenged the dominant status of realism and liberalism. Aiming to understand the relationship between discourses and international politics, researchers have analyzed an extensive range of issues such as terrorism (Jackson, 2005), the Bosnian War (Hansen, 2006), and Japanese nationalism (Suzuki, 2015) in official speeches, government statements, newspaper articles and academic works.

Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

1

Making a topic dictionary for semi-supervised classification of the UN speeches

Kohei Watanabe (Waseda University)

Yuan Zhou (Kobe University)

Paper presented at Quantitative Text Analysis Dublin Workshop, 2019

Abstract There is a growing interest in quantitative analysis of large corpora among the International Relations

(IR) scholars, but many of them find it difficult to perform analysis consistently with existing

theoretical frameworks using machine learning models to further develop the field. To solve this

problem, we developed a set of techniques that utilize a dictionary-based semi-supervised model that

allows researchers to classify documents into pre-defined categories efficiently. We propose a

dictionary making procedure to avoid inclusion of words that are likely to confuse the model and

deteriorate the classification accuracy using a new entropy-based diagnostic tool. In our experiments,

we classify sentences of the UN General Assembly speeches into six pre-defined categories using

the semi-supervised models (seeded LDA and Newsmap), trained with a small “seed dictionary” that

we create following the procedure. The result shows that, while simple dictionary classification

leaves 75% of sentences unclassified, Newsmap can classify over 60% of them accurately; its

accuracy becomes over 70% when contextual information is taken into consideration by kernel

smoothing of topic probabilities. We argue that once a topic dictionary is created by the IR

community, semi-supervised models can be an alternative to unsupervised topic classification

models.

Discourse analysis has been widely used in studies of International Relations (IR) since the

‘Third Great Debate’ (Banta, 2013; Holzscheiter, 2014; Lundborg & Vaughan-Williams, 2015;

Milliken, 1999), in which the constructivist scholars challenged the dominant status of realism and

liberalism. Aiming to understand the relationship between discourses and international politics,

researchers have analyzed an extensive range of issues such as terrorism (Jackson, 2005), the

Bosnian War (Hansen, 2006), and Japanese nationalism (Suzuki, 2015) in official speeches,

government statements, newspaper articles and academic works.

Page 2: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

2

While discourse analysis is conducted qualitatively in those studies, an increasing number

of IR scholars are embarking on quantitative analysis of large textual data, such as debates at the

United Nations General Assembly (UNGA), employing natural language processing techniques

(Baturo, Dasandi, & Mikhaylov, 2017; Gurciullo & Mikhaylov, 2017; Schoenfeld, Eckhard, Patz,

& van Meegdenburg, 2018). Speeches at the UNGA are very useful source information for IR

scholars, because the UNGA is the forum in which representatives express their countries’ foreign

policy on various issues such as development, terrorism and human rights. The speeches are

approximately 15 minutes long and transcribed and translated into English. The content of the

speeches enables IR scholars to infer and compare foreign policies of the UN member states over

the years. Baturo et al. (2017) collected all the speech transcripts from 1970-2017 and made them

available in the UN General Debate corpus.

The corpus was analyzed to measure similarity of foreign policies between states by Baturo

et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative text analysis techniques, but

their classification or measurement schemes lacked consistency with earlier content analysis that

was conducted manually (e.g. Brunn, 1999), due to the limitation of the machine learning models:

Latent Dirichlet Allocation (LDA) is a unsupervised model widely used in legislative or electoral

studies to classify documents into topics, but it is usually impossible for researchers to match topics

identified by an unsupervised LDA and human coders; supervised models such as naïve Bayes,

Random Forest and support-vector machines (SVMs) allow researchers to define topics but a

large-training set is often too expensive for resource strapped researchers to create.

We aim to solve these problems by developing a set of techniques to identify pre-defined

topics of documents using a semi-supervised machine learning model. In the following sections,

we will point out the problems of existing methods for topic classification, explain our semi-

supervised document classification techniques, and demonstrate their effectiveness in a series of

experiments. We will employ two semi-supervised models, Newsmap (Watanabe, 2018b) and

semi-supervised LDA (Lu, Ott, Cardie, & Tsou, 2011) both of which take seed words as weak

supervision. The former is a variant of naïve Bayes classifier trained on documents that contain

seed words, while the latter is an LDA model fitted with word priors weighted by seed words.

Therefore, the central piece of this paper is the technique to find good seed words for topics.

In the experiments, we will classify sentences of the UNGA speeches into six topics

(‘greeting’, ‘UN’, ‘security’, ‘human rights’, ‘democracy’, ‘development’) that are considered

Page 3: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

3

important in the IR literature. We choose sentences as the unit of analysis because we believe

sentence classification is difficult but natural: paragraphs or quasi-sentences are easier to classify

due to the greater number of words in a unit, but paragraphs contain multiple topics and quasi-

sentences lack objective boundaries (Däubler, Benoit, Mikhaylov, & Laver, 2012; Grimmer &

Stewart, 2013). The result will show that we can achieve more than 60% accuracy in the

classification of sentences into topics by Newsmap, while it is considerably lower in the semi-

supervised LDA. The classification accuracy is highest when the entropy-based diagnostic tools

are used to identify words that are likely to make definition of topics more ambiguous if included.

Furthermore, classification accuracy of sentences can be over 70% if we exploit the contextual

information of speeches by kernel smoothing of topic probabilities.

Problems of existing methods

Dictionary Analysis

Dictionaries are sets of keywords in pre-defined categories corresponding to certain

concepts, and often used in quantitative analysis as a robust theory-driven method. For example,

the Lexicoder Topic Dictionary (Albugh, Sevenans, & Soroka, 2013) contains 1,387 keywords

under 28 topics (e.g. macroeconomics, civil rights, healthcare, agriculture etc.) based on the Policy

Agenda Project’s coding scheme. Since keywords for categories are manually selected, dictionary

analysis allows researchers to assess not only the “criterion validity” but also the “content validity”

of the measurement (c.f. Adcock & Collier, 2001; Sartori & Pasini, 2007).

However, dictionaries have context dependency and repurposing existing dictionaries can

produce false results (Grimmer & Stewart, 2013). Therefore, if researchers wish to study

documents in fields or languages that are new to quantitative text analysis, they have to create a

dictionary from scratch, but it is not easy to collect hundreds of words manually (King, Lam, &

Roberts, 2017). The number of keywords required for a dictionary depends on the concepts to be

captured and documents to be analyzed, but it tends to be very large when they are complex and

sparse.

Even if dictionaries are available, researchers can only produce frequency counts in

dictionary analysis. Researchers obtain frequency counts of words in documents using a dictionary

and divide them by the length (total number of words) to gauge the silence of the concepts, but it

does not offer topic probability, which is available in statistical models. Therefore, keyword-based

Page 4: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

4

classification is only relative to other categories and there is no theoretically-grounded threshold

for the reliability of classification results such as the likelihood ratio.

Unsupervised Topic Models

A variety of unsupervised topic models such as the LDA (Blei, Ng, & Jordan, 2003), the

Correlated Topic Model (CTM) (Blei, Lafferty, & others, 2007), and the Bayesian Hierarchical

Topic Model (Grimmer, 2010) have been used to identify topics in documents. Roberts et al. (2016)

developed a structural topic model (STM), which allows researchers to discover the relationship

between topics and documents’ attributes (e.g. country and year) by incorporating them into topic

identification. However, these unsupervised models almost always produce topics that are difficult

to interpret from the theoretical point of view.

To highlight the problem of unsupervised topic models, we fitted an unsupervised LDA

model to discover 6 topics in the UN General Debate speeches in the post-Cold War period. Table

1 shows the words with the highest probability in topics, but it is very difficult to interpret them

substantively. For example, “united nation” relates to ‘UN’, but it appears all but in Topic 3;

“peace”, “conflict” and “security” are relevant words to ‘security’, but all the topics have at least

one of them. There is no word that clearly indicates ‘democracy’ or ‘greeting’. We might find these

topics in LDA models with a larger number of topics but many of the topics remain difficult or

impossible to interpret.

Table 1: Topics and words in unsupervised LDA Topic Words

1 world, united nations, peace, people, year, states, community, global, countries, support 2 international, development, united nations, peace, country, must, session, organization, economic, social 3 countries, must, states, can, conflict, also, work, us, republic, security 4 world, united nations, countries, new, efforts, also, peace, community, must, us 5 development, united nations, international, people, new, states, us, peace, need, also 6 international, united nations, development, countries, economic, security council, security, country,

human rights, community

Semi-supervised topic classification techniques

Computer scientists have developed various types of semi-supervised techniques, which

exploit not only labeled documents but also unlabeled documents in training models (Chapelle,

Schölkopf, & Zien, 2010). For example, Blum and Mitchell’s (1998) “co-training” technique

extracts features from a small number of labelled documents and uses these features to expand the

Page 5: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

5

training set for a greater classification accuracy. In their experiment, they iteratively expanded the

training set by including the most reliable classification results from a naïve Bayes classifier, and

achieved significantly lower prediction errors than a model without the expansion. Zelikovitz and

Hirsh (2000) have proposed a technique to expand training sets by nearest neighbor classification

of unlabeled documents, and demonstrated that this technique is the most effective when labelled

documents are short and scarce. More recently, external data such as Wikipedia has been used as

a source of universal knowledge to improve classification accuracy (Banerjee, Ramanathan, &

Gupta, 2007; Schönhofen, 2009). In the same vain, Phan, Nguyen and Horiguchi (2008) also

trained topic models (LDA and LSA) on external data to accurately classify short documents using

a supervised model.

While the above models are aimed at increasing the efficiency of supervised learning,

seeded models are aimed at improving the interpretability of results by using prior lexical

knowledge. Lu, Ott, Cardie, & Tsou (2011) developed a technique to weight prior distribution of

topics over words to detect sentences that mention specific issues. In their semi-supervised LDA

model1, pseudo counts are added to user-defined topic keywords before an LDA model is fitted.

Jagarlamudi, Daumé, and Udupa (2012) modeled topics as a mixture of user-defined topics and

other topics by adding extra parameters. Watanabe (2018b) developed Newsmap, a semi-

supervised model for geographical classification of documents, by combining dictionary analysis

and the naïve Bayes classifier. The model was also used for semi-supervised topic classification

(Watanabe, 2018a).

Naïve Bayes models can classify sentences into topics accurately because each sentence

usually has only one topic. However, creation of a seed dictionary for topic classification is more

difficult than for geographical classification since there are many more potential seed words for

topics than for countries. Seed word selection is a two-step process in which analysts collate

candidate words that potentially help to define the categories and choose only words that have

little ambiguity as seed words. Usually, short documents require a greater number of seed words

than long documents, because there is a smaller chance that seed words exist in each item to

achieve the same level of classification accuracy. However, seed words that match unrelated

documents significantly deteriorate classification accuracy, because false positive cases confound

1 The seeded LDA model is available in the R package, topicmodels (Hornik & Grün, 2011).

Page 6: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

6

the true association between categories and words in the machine learning process (Jagarlamudi

et al., 2012). We call words that match intended documents “true seed words” and those that do

not “false seed words”. Therefore, the goal in dictionary making for semi-supervised models is

maximizing the number of true seed words while minimizing the number of false seed words.

Semi-supervised Classification

Newsmap is a semi-supervised model originally created to classify short news summaries

according to their geographical focus (Watanabe, 2018b). Unlike full-supervised models,

Newsmap does not require a manually coded training set but a seed dictionary. First, the model

searches the entire corpus for words in a seed dictionary in order to assign labels to document;

second, the labels are used to train a multinomial naïve Bayesian model which recognizes the

association between the labels and features. This dictionary-based learning is advantageous

because (1) training of new models does not require any manual coding, (2) searching a corpus for

dictionary words is not computationally intensive, and (3) dictionaries can be ported to different

projects without any modification.

Keyword Selection

A common approach to evaluating the contribution of keywords to classification accuracy

is using a manually labeled ‘gold standard’ (King et al., 2017), but it might risk overfitting the

dictionary to the sample and being less portable to other projects. Use of multiple samples for

evaluation reduces this risk but manual coding of large amounts of the document is expensive.

Therefore, we devised an alternative approach that does not require manually coded documents in

seed selection. In this approach, analysts use coverage and entropy as diagnostic statistics to judge

whether a seed word would positively or negatively contribute to classification accuracy. Since

these statistics are computed without manually-coded documents, their diagnosis is only

probabilistic, but we will demonstrate that they offer analysts good guidance in dictionary making

in our experiments.

Keyword coverage

Keyword coverage indicates the proportion of the documents in which keywords occur.

While keyword coverage indicates the proportion of documents that can be classified regardless

of its accuracy in simple dictionary-based topic classification, it indicates the proportion of

Page 7: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

7

documents that can be used to train a model in dictionary-based semi-supervised learning.

Therefore, a higher keyword coverage is more desirable, but sentence-level classification requires

many keywords to achieve this due to the small number of words in documents.

Average feature entropy

Average feature entropy (AFE) measures the distribution of features across documents

labeled by a seed dictionary:

AFE =1𝑛𝑛�H(𝑓𝑓𝑖𝑖 + 1)𝑛𝑛

𝑖𝑖

𝑓𝑓𝑖𝑖 = [𝑐𝑐1, 𝑐𝑐2, 𝑐𝑐3, … , 𝑐𝑐𝑘𝑘 ]

where n is the total number of features, H is an entropy function and 𝑓𝑓𝑖𝑖 is a vector of frequencies

of features cooccuring with seed words in 𝑘𝑘 categories. Feature frequencies are smoothed by

adding one before computing entropy.

AFE measures the randomness of cooccurrences between dictionary words and other words

(features) in the corpus; it becomes low when features are concentrated to specific categories, but

it becomes high when features are evenly distributed. To compute AFE, we first apply a dictionary

to the corpus and aggregate feature frequencies by topic labels, then compute entropy for each

feature.

Contextual Smoothing

A trained Newsmap model can recognize a very large number of features to classify

sentences but some of them lack topic indicators. However, we can correctly classify those

sentences by considering the topics of surrounding sentences as we did in our manual coding.

Contextual smoothing does not require us to modify Newsmap’s algorithm, because we can simply

post-process the likelihood of topics; we smooth the likelihood using a kernel smoother and re-

classify sentences into topics with the highest probability.

Experiments

The data for our experiments is a subset of the United Nations General Debate Corpus

(Baturo et al. 2017). After excluding speeches in the Cold War to ensure the consistency of topics,

we sampled one speech every year from a different country (Table 2). In choosing sample countries,

we kept the balance in their international influence, geographical location, and levels of

Page 8: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

8

industrialization. We segmented these speeches into sentences and manually classified them into

five topics (‘UN’, ‘security’, ‘human rights’, ‘democracy’, ‘development’). These topics are based

on the previous studies on the UN (Smith, 2006; Zanotti, 2005), but we added ‘greeting’ as a

category for speakers’ opening remarks, although they are unlikely to be of IR scholars’ interest.

Table 2: Speeches chosen for experiment Year Country Year Country Year Country 1991 Australia 2000 United States 2009 Turkmenistan 1992 Libya 2001 Germany 2010 Palau 1993 Ukraine 2002 Russia 2011 China 1994 Sweden 2003 Afghanistan 2012 Norway 1995 Japan 2004 Myanmar 2013 Slovenia 1996 Brazil 2005 Tunisia 2014 Tuvalu 1997 Zimbabwe 2006 Colombia 2015 France 1998 Mexico 2007 United Kingdom 2016 Saudi Arabia 1999 Chad 2008 Equatorial Guinea 2017 Estonia

Manual coding

We employ manually classified sentences as the gold standard against which we test our

diagnostic tools and analytical techniques. However, sentence-level classification is not an easy

task even for human coders, because sentences often (1) contain more than one indicator or (2)

lack a clear indication of topics. If a sentence contains multiple topics, we classify them into the

most salient one; if sentences mention the UN’s role in issues such as human rights protection or

security assurance, we classify them into those substantive topics. If a sentence does not contain

clear topic indicators, we resort to topics in surrounding sentences to classify it.2

Seed word selection

Seed word selection has a critical importance in dictionary-based semi-supervised models.

However, humans can judge the relevance of keywords in a list but cannot easily create such a list

themselves (King et al., 2017). Therefore, we constructed two sets of seed words based on

knowledge or frequency. Knowledge-based seed words are smaller sets of words selected based

on researchers’ background knowledge in the field, while frequency-based seed words are larger

sets of words chosen from the most frequent words in the corpus. We consider knowledge-based

2 An example of such sentences without a clear indication of topic is “the most immediate area of concern for the international community is the situation in Yugoslavia”.

Page 9: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

9

seed words superior to frequency-based seed words as knowledge-based seed words offer

operational definitions of the concepts in the analysis.

For our experiment, we read the debate transcripts carefully and consulted glossaries and

indices of relevant books to select knowledge-based seed words; then we obtained a list of the 300

most frequent words in the corpus and manually classified the words into relevant topics as

frequency-based seed words; if the knowledge-based and frequency-based sets had common

elements, we removed them from the former, except the words used as names of topics to give

definitions (Table 3). After stemming the collected seed words to create glob patterns, we trained

a Newsmap model and tested its classification accuracy: overall knowledge-based seed words

(F1=0.53) achieved slightly higher accuracy than frequency-based seed words (F1=0.52); a larger

set combining these two performed the best (F1=0.57). Classification accuracy is highest in

‘development’ (F1=0.65) and lowest in ‘democracy’ (F1=0.36) and ‘human rights’ (F1=0.37) in

knowledge-based seed words, but ‘greeting’ (F1=0.23), ‘human rights’ (F1=0.25) and ‘democracy’

(F1=0.27) are even lower in frequency-based seed words (Table 4).

Table 3: Seed words

Page 10: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

10

Topic Knowledge-based Frequency-based GREETING greet*, thank*, congratulat*, sir,

express* great*, mr, wish*, hop*, contribut*, anniversar*, welcom*

UN united nations, international court*

security council, general assembly, organization*, reform*, secretary-general, resolution*, permanent member*, charter*, session*, conference*

SECURITY secur*, kill*, attack*, dispute*, victim*

peac*, terror*, weapon*, nuclear*, conflict*, war*, disarmament*, threat*, cris*, solution*, settlement*, force*, destruction*, militar*, violence*, arm*, fight*

HUMAN RIGHTS human rights, violat*, race*, dignit*, protect*, citizen*, educat*

humanitarian, child*, women, refugee*, communit*, people, respect*, responsib*, food*, health*

DEMOCRACY democra*, autocra*, dictator*, vote*, represent*, elect*, leader*

president*, party, institution*, government*, law*, republic*, free*, leadership*, legal*

DEVELOPMENT develop*, market*, investment* econom*, climate change, assistance*, sustain*, povert*, trade*, grow*, social*, environment*, prosperit*, progress*, financ*, cooperation*

Total # of words 29 66

Table 4: Classification accuracy of Newsmap (F1 scores) Topic Knowledge-based Frequency-based All

GREETING 0.495 0.235 0.343 UN 0.500 0.481 0.585 SECURITY 0.563 0.638 0.627 HUMAN RIGHTS 0.370 0.255 0.359 DEMOCRACY 0.362 0.276 0.349 DEVELOPMENT 0.654 0.642 0.678 Overall 0.535 0.520 0.570

We also fitted a semi-supervised LDA with the knowledge-based seed words for

comparison and obtained a more interpretable model than the unsupervised LDA (Table 5).3 There

are words that are strongly associated with unexpected topics (e.g. “peace” and “security” in

‘democracy’) but many of the words in our frequency-based seed word set are correctly identified

as topic words (e.g. “sustainable” and “poverty” in ‘development’). However, the classification

accuracy in the seeded LDA is only F1=0.40 overall, even when all the seed words are used (Table

6). Contrary to Newsmap, the seeded LDA performed better with the frequency-based set than the

knowledge-based set and combining these two sets changed its classification accuracy only very

little.

3 We set pseudo counts for the seed words to 500 (roughly 0.1% of the total number of sentences) but the choice of the value did not affect the result much.

Page 11: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

11

Table 5: Topics and words in seeded LDA Topic Words

GREETING also, general assembly, session, work, like, year, president, take, government, years

UN international, united nations, community, states, support, security council, organization, role, member, reform

SECURITY people, many, terrorism, war, end, situation, conflicts, security, however, still HUMAN RIGHTS world, must, can, us, new, human rights, one, today, challenges, human DEMOCRACY peace, cooperation, political, process, country, security, region, state, efforts, africa

DEVELOPMENT development, countries, economic, global, social, developing, sustainable, poverty, resources, trade

Table 6: Classification accuracy of seeded LDA (F1 scores) Topic Knowledge-based Frequency-based All

GREETING 0.324 0.332 0.315 UN 0.310 0.421 0.433 SECURITY 0.388 0.464 0.472 HUMAN RIGHTS 0.190 0.158 0.187 DEMOCRACY 0.092 0.117 0.147 DEVELOPMENT 0.539 0.567 0.560 Overall 0.348 0.400 0.407

Experiment 1: Keyword coverage

We randomly drew 1 to 7 seed words from each topic and trained a Newsmap model to

understand the relationship between the keyword coverage and the classification accuracy. Figure

1 shows that keyword coverage is strongly correlated with classification accuracy: F1 scores are

0.4 when the coverage is around 4% but it increases to 0.5 when the coverage is 10%. Furthermore,

the coverage is roughly the function of the number of seed words: the coverage is less than 5%

when each topic has 1-2 seed words, but it increases to more than 10% when each has 6-7 seed

words.

The strong positive correlation between the number of seed words and classification

accuracy suggests that the most effective strategy in classification by Newsmap is adding more

seed words to a dictionary. However, the improvement of accuracy slows down as the size of the

dictionary grows, because newly added words often produce false positive matches as discussed

in Experiment 3.

Experiment 2: Average feature entropy

We incrementally expanded the dictionary by randomly drawing a seed word from the

candidate set to reveal the relationship between AFE and the classification accuracy (Figure 2).

Page 12: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

12

Starting only with the knowledge-based seed words, we sampled the frequency-based seed words,

added them one-by-one (“weapon*”, “econom*”, “war*” etc.) to the dictionary and computed the

AFE and F1. The classification accuracy gradually increased as we added more seed words, but

F1 fell sharply coinciding with a rise in AFE when we included “people” (20th) and “hop*” (30th).

Changes in F1 and AFE in opposite directions can be also found in “legal*” (5th), “women” (6th),

“respect*” (12th), “responsib*” (15th), “institution*” (19th), “anniversary*” (21th) and

“government*” (27th).

We repeated the same procedure 100 times to understand the relationship between AFE

and F1. Figure 3 shows that changes in the AFE and F1 are negatively correlated (r=−0.49,

p<0.001), namely, it is twice more likely that F1 decreases when a seed word increases AFE. This

result indicates that changes in AFE can be used to identify risky seed words that lead to lower

classification accuracy if included.

Experiment 3: Selection criteria

The results of Experiment 1 and 2 suggest that it is possible to reduce the risk of adding

seed words that are likely to deteriorate the classification accuracy using the AFE statistic. In order

to test this possibility, we computed changes in AFE when the frequency-based seed words are

added to a dictionary only with the knowledge-based seed words.

The result showed that 7 words in ‘greeting’, 10 words in ‘human rights’, and 6 words in

“democracy” are risky seed words (Table 7). The risky seed words are concentrated in the three

topics because AFE correctly predicted that these frequency-based seed words deteriorate the

classification accuracy (Table 4). If the frequency-based seed words are added to the dictionary

excluding the risky seed words, the accuracy improved to F1=0.61, which is significantly higher

than only with the knowledge-based seed words (F1=0.53) or with all the seed words (F1=0.57).

Conversely, if only the risky seed words are added to the dictionary, the classification accuracy

decreased significantly (F1=0.44), showing that risky seed words identified by AFE are false seed

words. Changes in the classification accuracy by risky words are shown in Figure 4.

Page 13: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

13

Table 7: Risky seed words identified by AFE Topic Seed words

GREETING great*, mr, wish*, hop*, contribut*, anniversar*, welcom*

HUMAN RIGHTS humanitarian, child*, women, refugee*, communit*, people, respect*, responsib*, food*, health*

DEMOCRACY party, institution*, law*, republic*, free*, legal*

Experiment 4: Contextual smoothing

We constructed a dictionary excluding the risky seed words and trained a Newsmap model

to achieve the highest classification accuracy, but the model still cannot classify sentences that

lack topic indicators at all. Human coders can resort to surrounding topics to classify such

sentences but Newsmap models cannot. To mimic manual classification, we predicted topics of

individual sentences and reclassified them based on topic probabilities smoothed by Daniell kernel

with windows size m=3. With this window size, we take topics of three preceding and succeeding

sentences into consideration to identify the topic of the current sentence. Figure 5 shows that the

smoothed probabilities can capture the natural transition of topics in a speech made by the Minister

of Foreign Affairs of Ukraine in 1993.

The contextual smoothing improved the classification accuracy of speeches from F1=0.61

to 0.72 overall. Table 8 shows that, although the smoothing had a negative effect in Estonia (−0.21),

Tunisia (−0.15) and Turkmenistan (−0.09), it had a significant positive effect in other cases,

especially in Ukraine (+0.31), Mexico (+0.28), Russia (+0.23) and France (+0.17). Although the

classification accuracy of the United States (F1=0.57) and France (F1=0.55) is still low in absolute

terms because of the low classification accuracy in ‘human rights’ and ‘democracy’ that speakers

of the two countries tend to emphasize, the result clearly shows the effectiveness of contextual

smoothing in sentence classification.

Page 14: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

14

Table 8: Classification accuracy by countries Country Code Year # of sentence Raw (F1) Smooth (F1) Change

Australia AUS 1991 139 0.662 0.727 0.064 Libya LBY 1992 152 0.579 0.724 0.144 Ukraine UKR 1993 144 0.596 0.875 0.279 Sweden SWE 1994 118 0.654 0.814 0.160 Japan JPN 1995 97 0.722 0.784 0.061 Brazil BRA 1996 94 0.642 0.691 0.050 Zimbabwe ZWE 1997 71 0.663 0.704 0.042 Mexico MEX 1998 82 0.606 0.890 0.284 Chad TCD 1999 105 0.564 0.695 0.131 United States USA 2000 101 0.531 0.574 0.043 Germany DEU 2001 110 0.680 0.773 0.092 Russia RUS 2002 75 0.609 0.840 0.231 Afghanistan AFG 2003 71 0.566 0.662 0.096 Myanmar MMR 2004 102 0.509 0.529 0.021 Tunisia TUN 2005 20 0.750 0.600 -0.150 Colombia COL 2006 52 0.578 0.808 0.230 United Kingdom GBR 2007 86 0.633 0.709 0.077 Equatorial Guinea GNQ 2008 34 0.761 0.794 0.033 Turkmenistan TKM 2009 60 0.629 0.533 -0.096 Palau PLW 2010 72 0.512 0.556 0.044 China CHN 2011 124 0.765 0.823 0.058 Norway NOR 2012 63 0.560 0.556 -0.004 Slovenia SVN 2013 43 0.618 0.837 0.219 Tuvalu TUV 2014 80 0.641 0.740 0.099 France FRA 2015 102 0.386 0.559 0.173 Saudi Arabia SAU 2016 45 0.825 0.822 -0.002 Estonia EST 2017 84 0.594 0.573 -0.021

Proposed dictionary-making procedure

Our experiments suggest that the best strategy in dictionary-making for semi-supervised

models is increasing the number of seed words to maximize the keyword coverage while avoiding

false seed words that damage classification accuracy. It is easy for researchers to increase the

number of seed words if they collect frequent words from corpora, but they often include false

seed words in this process. However, researchers can employ the AFE statistic to reduce the risk

of selecting false seed words.

Therefore, we propose a dictionary making procedure by which researchers can perform

theory-driven content analysis of large corpora with minimal cost:

1. Define categories

2. Select words that are important from the theoretical point of view (knowledge-based set)

for respective categories

Page 15: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

15

a. External sources (thesauri, glossaries and indices of books) can be used to collect

words

b. Words should be selected regardless of their frequency in the corpus

3. Select words that are frequent and equivalent to those selected in 2 (frequency-based set)

a. A list of most frequent words (term or document frequency) in the corpus should

be used to collect words

4. Add all the words in the knowledge-based set to the dictionary

a. Use AFE of the knowledge-based set as the benchmark

5. Add words in the frequency-based set to the dictionary incrementally

a. Remove words that increase AFE

Conclusions

Compared to unsupervised topic models and simple dictionary analysis, the advantage of

the semi-supervised classification is clear in IR research: (1) researchers can define categories of

classification in dictionaries using seed words to make their analysis consistent with existing

theoretical frameworks, and (2) they only need to choose a small number of seed words for each

category to accurately classify documents. In the experiments, we have demonstrated that

Newsmap can classify over 50% of sentences accurately with only a few seed words in each topic

and the accuracy increased to over 60% when more seed words were added. If simple dictionary

analysis is performed instead, 75% of sentences will be unclassified because those keywords only

appeared in a small percentage of sentences.

The semi-supervised LDA underperformed Newsmap, presumably because the model has

inappropriate assumption for sentence classification and excessive complexity considering the

small number of seed words available. LDA models assume documents to have multiple topics but

sentences of speeches usually have only one topic. The complexity of the model also demands a

greater amount of weak supervision to sufficiently learn. The greater cost of training the semi-

supervised LDA is clear as it performs better with frequency-based seed words than knowledge-

based seed words. Nevertheless, we admit that further investigation is required to make conclusive

remarks on the semi-supervised LDA models. We expect that the model will outperform Newsmap

when documents are longer, and it is trained with a sufficiently large number of seed words.

Page 16: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

16

The semi-supervised classification technique is particularly useful when researchers apply

quantitative text analysis in new fields, because training sets for full-supervised models are very

expensive to produce and dictionaries are often unavailable outside comparative politics or

political communication and nearly non-existent in non-European languages. Although creation of

a large dictionary from scratch is not feasible for researchers who are strapped for resources,

dictionary-based semi-supervised models allow them to perform quantitative analysis only with a

small seed dictionary.

Semi-supervised document classification is also an alternative approach to multi-lingual

text analysis which has become increasingly popular in recent years. Multi-lingual analysis can be

achieved by translating a large dictionary (Proksch, Lowe, Wäckerle, & Soroka, n.d.) or a corpus

(De Vries, Schoonvelde, & Schumacher, 2018), but it can also be done by translating a small set

of seed words if semi-supervised models are used. In fact, Newsmap performs multi-lingual

geographical classification of news using functionally equivalent seed dictionaries in multiple

languages (e.g. English, German, Spanish, French, Japanese, Russian, Chinese).

However, we recognize that the difficulty in making even a small dictionary is an obstacle

for researchers to employ dictionary-based semi-supervised models. Therefore, we created the

AFE statistic as a diagnostic tool that does not require manually coded documents as gold-standard.

We believe dictionaries should be constructed based primarily on theory, but a high classification

accuracy can be achieved by including frequent words in corpora. Addition of frequent false seed

words to a dictionary hugely harms classification accuracy, but researchers can reduce the risk by

paying attention to changes in AFE.

The AFE statistic allowed us to achieve higher classification accuracy by removing many

of the frequency-based seed words in ‘greeting’, ‘human rights’ and ‘democracy’ in our experiment,

but it is still likely that some of them were true seed words that could improve the classification

accuracy if included. This did not occur in our experiment probably because our knowledge-based

seed words were inappropriate and some of the true frequent seed words were wrongly identified

as risky seed words. This shows how difficult it is to operationalize abstract concepts such as

human rights and democracy using words, but we believe it is possible as a collective endeavor:

researchers who employ semi-supervised models should publish their dictionaries to allow others

to refine them by adding or removing seed words; words that are commonly used in their

dictionaries will be knowledge-based seed words that are independent of corpora.

Page 17: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

17

Contextual smoothing of topic probability is an auxiliary technique in this study, which is

not directly related to the semi-supervised classification technique, but we believe that the

technique can help in many cases. We expect that the contextual smoothing technique would

improve classification accuracy of sentences in various machine learning models without changing

model assumptions and specifications. This is the primary reason that we decided to train the

Newsmap model on individual sentences instead of a temporal group of sentences, despite the fact

that training on both labeled sentences and unlabeled surrounding sentences is closer to semi-

supervised learning techniques in computer science (Zelikovitz & Hirsh, 2000).

Finally, we hope the readers of this paper understand the strengths and weaknesses of

dictionary-based semi-supervised models vis-a-vis popular methods such as topic models and

simple dictionary analysis. Use of semi-supervised models requires knowledge of both

methodology and substance but there will be an increasing number of young political scientists

who understand both aspects. We hope our work will encourage them to embark on quantitative

analysis of documents in under-studied subjects and languages on the horizon of quantitative text

analysis.

Page 18: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

18

Figures

Figure 1: Correlation between keyword coverage and classification accuracy. Classification

accuracy (F1 scores) is strongly correlated with keyword coverage.

Page 19: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

19

Figure 2: Changes in the keyword coverage, AFE and F1 by frequency-based seed words.

Page 20: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

20

Figure 3: Association between AFE and classification accuracy changes

Page 21: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

21

Figure 4: Changes in F1 by knowledge-based seed words

Figure 5: Smoothed topic probabilities (Ukraine in 1993).

Page 22: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

22

Figure 6: Classification accuracy of individual speeches.

Bibliography

Adcock, R., & Collier, D. (2001). Measurement Validity: A Shared Standard for Qualitative and

Quantitative Research. The American Political Science Review, 95(3), 529–546.

Albugh, Q., Sevenans, J., & Soroka, S. (2013). Lexicoder Topic Dictionaries, June 2013 versions.

Retrieved from http://www.lexicoder.com/docs/LTDjun2013.zip

Banerjee, S., Ramanathan, K., & Gupta, A. (2007). Clustering Short Texts Using Wikipedia.

Proceedings of the 30th Annual International ACM SIGIR Conference on Research and

Development in Information Retrieval, 787–788.

https://doi.org/10.1145/1277741.1277909

Banta, B. (2013). Analysing discourse as a causal mechanism. European Journal of International

Relations, 19(2), 379–402.

Baturo, A., Dasandi, N., & Mikhaylov, S. J. (2017). Understanding state preferences with text as

data: Introducing the UN General Debate corpus. Research & Politics, 4(2),

2053168017712821. https://doi.org/10.1177/2053168017712821

Page 23: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

23

Blei, D. M., Lafferty, J. D., & others. (2007). A correlated topic model of science. The Annals of

Applied Statistics, 1(1), 17–35.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3,

993–1022.

Blum, A., & Mitchell, T. (1998). Combining Labeled and Unlabeled Data with Co-training.

Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 92–

100. https://doi.org/10.1145/279943.279962

Brunn, S. D. (1999). The worldviews of small states: A content analysis of 1995 UN speeches.

Geopolitics, 4(1), 17–33.

Chapelle, O., Schölkopf, B., & Zien, A. (Eds.). (2010). Semi-Supervised Learning. The MIT Press.

Däubler, T., Benoit, K., Mikhaylov, S., & Laver, M. (2012). Natural Sentences as Valid Units for

Coded Political Texts. British Journal of Political Science, 42(04), 937–951.

https://doi.org/10.1017/S0007123412000105

De Vries, E., Schoonvelde, M., & Schumacher, G. (2018). No Longer Lost in Translation:

Evidence that Google Translate Works for Comparative Bag-of-Words Text Applications.

Political Analysis, 26(04), 417–430. https://doi.org/10.1017/pan.2018.26

Grimmer, J. (2010). A Bayesian Hierarchical Topic Model for Political Texts: Measuring

Expressed Agendas in Senate Press Releases. Political Analysis, 18(1), 1–35.

https://doi.org/10.1093/pan/mpp034

Grimmer, J., & Stewart, B. M. (2013). Text as Data: The Promise and Pitfalls of Automatic Content

Analysis Methods for Political Texts. Political Analysis, 1–31.

https://doi.org/10.1093/pan/mps028

Page 24: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

24

Gurciullo, S., & Mikhaylov, S. (2017). Topology Analysis of International Networks Based on

Debates in the United Nations. ArXiv:1707.09491 [Cs, Math, Stat]. Retrieved from

http://arxiv.org/abs/1707.09491

Hansen, L. (2006). Security as practice: discourse analysis and the Bosnian war. Routledge.

Holzscheiter, A. (2014). Between communicative interaction and structures of signification:

Discourse theory and analysis in international relations. International Studies Perspectives,

15(2), 142–162.

Jackson, R. (2005). Writing the war on terrorism: Language, politics and counter-terrorism.

Manchester University Press.

Jagarlamudi, J., Daumé, H., III, & Udupa, R. (2012). Incorporating Lexical Priors into Topic

Models. Proceedings of the 13th Conference of the European Chapter of the Association

for Computational Linguistics, 204–213. Retrieved from

http://dl.acm.org/citation.cfm?id=2380816.2380844

King, G., Lam, P., & Roberts, M. E. (2017). Computer-Assisted Keyword and Document Set

Discovery from Unstructured Text. American Journal of Political Science, 61(4), 971–988.

https://doi.org/10.1111/ajps.12291

Lu, B., Ott, M., Cardie, C., & Tsou, B. K. (2011). Multi-aspect sentiment analysis with topic

models. 2011 IEEE 11th International Conference on Data Mining Workshops, 81–88.

IEEE.

Lundborg, T., & Vaughan-Williams, N. (2015). New Materialisms, discourse analysis, and

International Relations: a radical intertextual approach. Review of International Studies,

41(1), 3–25.

Page 25: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

25

Milliken, J. (1999). The study of discourse in international relations: A critique of research and

methods. European Journal of International Relations, 5(2), 225–254.

Phan, X.-H., Nguyen, L.-M., & Horiguchi, S. (2008). Learning to Classify Short and Sparse Text

& Web with Hidden Topics from Large-scale Data Collections. Proceedings of the 17th

International Conference on World Wide Web, 91–100.

https://doi.org/10.1145/1367497.1367510

Proksch, S.-O., Lowe, W., Wäckerle, J., & Soroka, S. (n.d.). Multilingual Sentiment Analysis: A

New Approach to Measuring Conflict in Legislative Speeches. Legislative Studies

Quarterly, 0(0). https://doi.org/10.1111/lsq.12218

Roberts, M. E., Stewart, B. M., & Airoldi, E. M. (2016). A model of text for experimentation in

the social sciences. Journal of the American Statistical Association, 111(515), 988–1003.

Sartori, R., & Pasini, M. (2007). Quality and Quantity in Test Validity: How can we be Sure that

Psychological Tests Measure what they have to? Quality & Quantity, 41(3), 359–374.

https://doi.org/10.1007/s11135-006-9006-x

Schoenfeld, M., Eckhard, S., Patz, R., & van Meegdenburg, H. (2018). Discursive Landscapes and

Unsupervised Topic Modeling in IR: A Validation of Text-As-Data Approaches through a

New Corpus of UN Security Council Speeches on Afghanistan. ArXiv:1810.05572 [Cs].

Retrieved from http://arxiv.org/abs/1810.05572

Schönhofen, P. (2009). Identifying Document Topics Using the Wikipedia Category Network. Web

Intelli. and Agent Sys., 7(2), 195–207. https://doi.org/10.3233/WIA-2009-0162

Smith, C. B. (2006). Politics and process at the United Nations: the global dance. Lynne Rienner

Boulder, CO.

Page 26: Abstract - blog.koheiw.net · The corpus was analyzed to measure similarity of foreign policies between states by Baturo et al. (2017) and Gurciullo and Mikhaylov (2017) using quantitative

26

Suzuki, S. (2015). The rise of the Chinese ‘Other’in Japan’s construction of identity: Is China a

focal point of Japanese nationalism? The Pacific Review, 28(1), 95–116.

Watanabe, K. (2018a). Conspiracist propaganda: How Russia promotes anti-establishment

sentiment online? Presented at the ECPR General Conference, Hamburg.

Watanabe, K. (2018b). Newsmap: A semi-supervised approach to geographical news classification.

Digital Journalism, 6(3), 294–309. https://doi.org/10.1080/21670811.2017.1293487

Zanotti, L. (2005). Governmentalizing the post–Cold War international regime: The UN debate on

democratization and good governance. Alternatives, 30(4), 461–487.

Zelikovitz, S., & Hirsh, H. (2000). Improving short text classification using unlabeled background

knowledge to assess document similarity. Proceedings of the 17th International

Conference on Machine Learning.