67
A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING PUBLIC EPIDEMIC REPORTS INTO LINKED DATA By MATTHEW A. DILLER A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2018

A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING PUBLIC EPIDEMIC REPORTS INTO LINKED DATA

By

MATTHEW A. DILLER

A THESIS PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2018

Page 2: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

© 2018 Matthew A. Diller

Page 3: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

To my Mom

Page 4: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

4

ACKNOWLEDGMENTS

First and foremost, I would like to thank my mother whose support and dedication

to my education has doubtlessly been the biggest contributing factor to my academic

achievements thus far in life. I truly believe that, had you not fostered in me a passion

for learning when I was younger, I probably would not have decided to become a

scientist.

I would also like to thank my advisor and mentor, Bill Hogan, who first introduced

me to the field of biomedical informatics back in 2014. Had it not been for that chance

encounter between you and Mitch, I probably would never have had the opportunity to

dive into this field, which I have grown to love over the last few years. There have been

many times in the last few years that I’ve hit a brick wall of self-doubt only to have it

quashed by your encouragement and praise; for this reason (and many others), you

have truly been a great mentor.

In addition, I would like to thank both Jiang Bian and Amanda Hicks for their

mentorship and assistance, which have been invaluable to me throughout this journey.

In addition to having benefited from your wisdom, I have also learned from you the

value of mentorship to a student and will try to emulate it for any students that I get to

mentor in the future.

Finally, I would like to thank my best friend, Todd Sahagian, for his support and

his assistance with the validation step of corpus annotation. I couldn’t have done it

without you, man.

Page 5: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS ...................................................................................................... 4

LIST OF TABLES ................................................................................................................ 6

LIST OF FIGURES .............................................................................................................. 7

LIST OF ABBREVIATIONS ................................................................................................. 8

ABSTRACT ........................................................................................................................ 10

CHAPTER

1 INTRODUCTION ........................................................................................................ 12

2 METHODS .................................................................................................................. 17

Statement of Purpose ................................................................................................. 17 Data Source ................................................................................................................ 17

Components ................................................................................................................ 18 Web Scraper ............................................................................................................... 18 Named-entity Recognition .......................................................................................... 19 Entity Resolution ......................................................................................................... 31

3 RESULTS .................................................................................................................... 43

Web Scrapper ............................................................................................................. 43 Named-entity Recognition .......................................................................................... 43 Entity Resolution ......................................................................................................... 45

4 DISCUSSION .............................................................................................................. 51

Limitations ................................................................................................................... 55 Future Work ................................................................................................................ 58

APPENDIX

ANNOTATION GUIDELINE ........................................................................................ 61

LIST OF REFERENCES ................................................................................................... 63

BIOGRAPHICAL SKETCH ................................................................................................ 67

Page 6: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

6

LIST OF TABLES

Table page 3-1 Baseline set of features selected for the first round of training from the

Stanford NER NERFeatureFactory ........................................................................ 48

3-2 Summary of the CRF model performance for the first and second rounds of training, and the final round of testing.................................................................... 49

3-3 Additional features added to the baseline features for the second round of model training ......................................................................................................... 50

Page 7: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

7

LIST OF FIGURES

Figure page 2-1 Flow chart of methods. ........................................................................................... 40

2-2 Example web page from the National Wildlife Health Center’s Avian Influenza News Archive .......................................................................................................... 41

2-3 Images that illustrate heterogeneity in the formatting and use of punctuation in headings of reports. ............................................................................................ 41

2-4 Example output file of NLP pre-processing step. .................................................. 42

3-1 Number of avian influenza epidemics by country from 2006 to 2017. .................. 48

3-2 Number of avian influenza epidemics by year from 2005 to 2017. ....................... 49

3-3 Number of avian influenza epidemics by host from 2006 to 2017. ....................... 49

3-4 Number of avian influenza epidemics by influenza pathogen from 2006 to 2017. ....................................................................................................................... 50

Page 8: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

8

LIST OF ABBREVIATIONS

API Application programming interface

BIO2 Beginning-inside-outside format 2

brat Brat rapid annotation tool

CDC Centers for Disease Control and Prevention

CMM Conditional Markov model

CRF Conditional random field

ER Entity resolution

FN False negative

FP False positive

GPHIN Global Public Health Intelligence Network

HMM Hidden Markov model

HTML Hypertext Markup Language

IAA Inter-annotator agreement

IO Inside-outside format

IRI Internationalized Resource Identifier

ISO International Organization for Standardization

JSON Javascript Object Notation

MedISys Medical Information System

NCBI National Center for Biotechnology Information

NER Named-entity recognition

NLP Natural language processing

RDFS Resource Description Framework Schema

SARS Severe Acute Respiratory Syndrome

Page 9: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

9

SPARQL SPARQL Protocol and Resource Description Framework Query Language

SQL Structured Query Language

TP True positive

UAE United Arab Emirates

UK United Kingdom

US United States

USGS United States Geological Survey

WHO World Health Organization

Page 10: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

10

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Master of Science

A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING PUBLIC EPIDEMIC REPORTS INTO LINKED DATA

By

Matthew A. Diller

August 2018

Chair: William R. Hogan Major: Medical Sciences

The growing use of public health information systems that can detect epidemics

rapidly by extracting information from textual Web-based reports, such as online news

articles, as they are published in real time has shown promising results for disease

surveillance. Unfortunately, current resources that rely on Web-based reports for

disease surveillance are not designed to automatically extract this epidemiological data

from these reports, and therefore do not take advantage of the full potential of the data

contained in each.

One of the challenges to extracting such data from unstructured, textual Web-

based reports is identifying and linking data that are about individual epidemics in

multiple reports that have been published over a period of time. Therefore, the focus of

this work is to develop and evaluate a set of tools that use state-of-the-art informatics

technologies to extract data about avian influenza epidemics from textual Web-based

reports and use this data to identify and link reports that are about the same epidemics.

The online data source that I use is the Avian Influenza News Archive, which is

maintained by the United States Geological Survey. This archive contains serially-

published reports about avian influenza epidemics from November 7, 2006 to

Page 11: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

11

September 28, 2017 and is publicly available. The first tool that I developed is a web

scraper that extracts the report text from each web page and stores it locally. The

second tool is a named-entity recognizer that labels named entities in each report that

refer to or denote locations, dates, host organisms, and influenza pathogens. Finally,

the third tool is a rule-based entity resolver that uses the labeled terms to identify

mentions of individual epidemics in each report and link them to epidemics identified in

other reports.

In total, the scraper extracted the text of 1,963 epidemic reports from which the

entity resolver identified 1,144 individual epidemics. Despite having a small training

dataset for NER model training, the overall results for all four named entities were

satisfactory (precision=0.9220, recall=0.7821, F-score=0.8463). Across all of the

identified epidemics, China was the most common location (18.62% of all epidemics);

H5N1 was the most common influenza subtype (65.25% of all epidemics); and birds

were the most common host (46.77% of all epidemics). Taken together, this work

illustrates the feasibility of applying entity resolution to textual Web-based reports to

identify and link reports that are about the same epidemic.

Page 12: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

12

CHAPTER 1 INTRODUCTION

Recent epidemics of various infectious diseases—such as Ebola, severe acute

respiratory syndrome (SARS), Zika, and avian influenza—have demonstrated the need

for public health information systems that can detect epidemics rapidly, assess the

regional risk of epidemics with a high degree of specificity and sensitivity, and update

decision makers in real time as epidemics evolve. These systems require timely and

highly accurate input data, so that epidemiologists know precisely where the epidemic is

occurring and where it might spread and so that responders can develop well-informed

plans on how to address the epidemic. Historically, disease surveillance and forecasting

of disease outbreaks has relied on reports from sentinel healthcare providers and

laboratory results. These data are typically not available to physicians, researchers,

and the general public until weeks after the data were first collected. One of the

consequences of this situation is that computational epidemiologists—whose

mathematical models of disease transmission often inform public health officials and

policy makers on how to mitigate the spread of a disease—are sometimes unable to

react to an outbreak until well after the critical point at which it has evolved into an

epidemic.

To address the need for early detection of and ongoing real-time updates about

epidemics, researchers have studied the use of disparate Internet-based sources of

information, such as Google search queries [1–4], Wikipedia page access logs [5,6],

Twitter [3,7,8], online grey literature [9–11], and online news reports [9–11]. The

advantages of this source of information are that data can be obtained in real- or near-

real time at a high degree of spatial resolution, and that the resource cost of obtaining

Page 13: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

13

the data is generally low. However, online data tend to be either unstructured or semi-

structured, thus making efforts to aggregate and interpret data from multiple sources

challenging. In addition, certain online data sources are often lacking in specificity

and/or sensitivity, as was the case for Google Flu Trends [4], which can pose practical,

economic, and ethical concerns regarding their utility to public health officials who

cannot afford to waste precious resources on false positive warnings of outbreaks.

Despite these drawbacks, Internet-based resources have proven valuable for the

early detection and modeling of epidemics. Between 2013 and 2014, digital reports of

polio cases preceded official reports by the World Health Organization (WHO) for all

seven outbreaks that occurred within that time period [12]. Likewise, in 2014, online

news reports about an outbreak of a Lassa-fever-like hemorrhagic fever in eight people

in Guinea were published one week ahead of an official case report released by the

WHO [12]. Online news and official reports on both MERS and Ebola epidemics have

also been shown to be useful for estimating the basic reproduction number and other

disease parameters that epidemiologists rely on for assessing the risk for an outbreak

and for determining the appropriate control measure(s) to implement [13,14].

Meanwhile, Paul et al. have demonstrated that combining data scraped from Twitter

with historic incidence data improves the performance of influenza forecasting models

relative to using historical incidence data alone. Taken together, these findings illustrate

that online data sources are capable of providing epidemiologists with actionable

information on possible outbreaks in a timely manner, and that aggregating them with

more traditional data sources can result in improved estimates of outbreak

characteristics.

Page 14: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

14

There still remains a considerable amount of work to be done before the full

potential of these online resources is realized. Much of this work will involve the

programmatic extraction of data from the text of online outbreak reports, which is

challenging due to the unstructured or semi-structured nature of many of these reports,

the heterogeneity in which the data is presented in the text, the possibility of formatting

changes to the text or the Hypertext Markup Language (HTML) documents that contain

them, and the lack of a guarantee that some sources will be available in the future.

Indeed, some tools have adopted hybrid approaches, which consist of computational

and manual methods for extracting epidemiological data from online reports. Cleaton et

al. [13] and Cauchemez et al. [14] decided to forego using computational methods

altogether, and instead manually extracted the data. Such an approach is labor

intensive and can take a long time to complete if a large volume of reports is to be

utilized.

At present, the available epidemiological tools that derive their data from official

or news-based online reports—the Global Public Health Intelligence Network (GPHIN),

HealthMap, the Medical Information System (MedISys), and ProMED-mail—are

designed to simply detect outbreaks as they emerge [10,11,15,16]. Thus, a lot of

epidemiological data, such as data about transmission patterns or the affected host

species, are not extracted from these reports. As I have mentioned, these data can

have a lot of value to epidemiologists for studying and developing responses to current

epidemics. However, without an automated way of identifying reports that are about the

same epidemic—an initial step to extracting epidemiological data from them—a great

amount of time and effort is needed to do so manually. This can be problematic when

Page 15: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

15

dealing with rapidly developing epidemics where any delays in the development and

implementation of control measures can be disastrous.

Therefore, if the full potential of online reports as a data source for infectious

disease epidemiology is to be realized, part of that effort will involve developing

computational tools that can draw links between data about the same epidemic as it is

being reported on over time. Ideally, these tools would extract data from several

sources (e.g., ProMED-mail, online news articles, and Centers for Disease Control and

Prevention (CDC) reports), identify data that are about the same epidemic in each of

these sources, and then link those data together using entity resolution. These tools

would then be able to provide users with data about individual epidemics that are

occurring simultaneously and update these data as new reports are published by one or

more sources. With these data, epidemiologists and public health professionals would

then be able to develop more well-informed responses to these epidemics.

The focus of this thesis is to create and evaluate a set of tools that extracts data

from a set of Web-based reports and links multiple reports over time to the individual

epidemics they are about. The first tool in this set is a web scraper that fetches textual

reports that describe avian influenza epidemics that occurred across the globe in

various host populations from 2006 to 2017. The second tool is a named-entity

recognition classifier for identifying specific information in the text that identifies

individual epidemics. The third and final tool is a rule-based entity resolver that

generates data about individual epidemics within each report and then connects these

data to other data about the same individual epidemic identified in earlier reports. The

Page 16: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

16

net result is that for each distinct epidemic, we can trace sequentially the reports that

describe it over time to see how it evolved.

Page 17: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

17

CHAPTER 2 METHODS

Statement of Purpose

The goal of this project is to create query-able, structured data about avian

influenza epidemics from unstructured, web-based, narrative reports about these

epidemics. These epidemic reports are released approximately every two weeks by the

US Geological Survey (USGS). I will classify fragments of text in this corpus of USGS

epidemic reports as referring to or denoting an influenza pathogen, a location, a host

population, or a date in order to 1) perform entity resolution to identify unique epidemics

in epidemic reports that serially update information about ongoing epidemics, and 2)

distinguish among multiple different epidemics that are occurring simultaneously and

that may be in proximity to one another.

Data Source

I used the Avian Influenza News Archive of the United States Geological

Survey’s National Wildlife Health Center as the source of influenza epidemic information

[17]. As of May 15, 2018, this archive contained 1,963 epidemic reports across 442

Web pages published from November 7, 2006 to September 28, 2017 on avian

influenza epidemics.

Each web page typically consists of at most three subsections: avian influenza in

wild animals, avian influenza in domesticated animals, and avian influenza in humans.

Each subsection typically contains one or more epidemic reports for a host–country–

influenza-subtype combination. For example, the September 9, 2016 web page consists

of two subsections—“Avian Influenza in Poultry” and “Avian Influenza in Humans”—with

four epidemic reports under “Avian Influenza in Poultry” and one under “Avian Influenza

Page 18: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

18

in Humans”. Specifically, the web page contains reports of (1) H9N2 influenza in

humans in China, (2) H5N6 influenza in poultry in China, (3) H5N1 in birds in Ghana, (4)

H5N2 in ostriches in South Africa, and (5) H5N8 in ducks and chickens in South Korea

(Figure 2).

Components

This project consists of three main tasks: 1) creating a web scraping module to

fetch the Web pages and identify the epidemic reports contained in them, 2) training an

NLP module for named entity recognition (NER) to identify the four types of named

entities (that I describe below) in the reports, and 3) building an entity resolution module

to identify unique influenza epidemics from the output of the NLP step (Figure 1). Each

of these tasks consists of a set of subtasks, as I shall discuss.

Web scraper

The web scraper is a Python version 3.5 [18] module that is comprised of two sub-

modules for data extraction . The first sub-module (html_scraper.py) sequentially

fetches each web page (including the one shown in Figure 2) from the USGS Avian

Influenza News Archive and saves it locally as an HTML file. The second sub-module

(scraper.py) then iterates over each HTML file and creates a parse tree from its HTML,

using the Beautiful Soup Python package [19]. From there it extracts the text-heading

and content of each epidemic report contained in the HTML file and saves them to a

structured JSON file. Accordingly, each JSON file corresponds to a single web page,

and contains the text heading and body of one or more epidemic reports.

Manual extraction was necessary for (1) three web pages that displayed some of the

data in tabular form, rather than as textual reports, as well as (2) 24 pages that

Page 19: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

19

contained duplicates or errors. In addition, three web pages were exact duplicates of the

previous page and were therefore excluded. After excluding the three duplicate web

pages, there were 439 web pages to process. Of these 439 pages, 27 (6.14%) required

manual extraction.

In addition, a small number of the reports within the web pages contain

information on recently published studies, health policy changes, and ecological reports

that relate to avian influenza and its various host species. This made the task of web

scraping more difficult since the headers of these latter types of reports may be

formatted or worded differently than the epidemic reports themselves. In addition, it

adds complexity to the task of entity resolution as it potentially creates a false indication

of an influenza outbreak.

Named Entity Recognition

One of the main problems with extracting data from unstructured natural

language text is how to determine if a particular word or phrase is relevant to the task at

hand. One approach to determining relevance is to manually read through the text and

select each word or phrase that meets a set of pre-established criteria of relevance.

Since one of the goals of this task is to extract data about individual epidemics, the

criteria that I selected are based on four essential properties of epidemics that are

described in the definition of ‘epidemic’ from the World Health Organization (WHO).

According to the WHO, an epidemic is (emphasis added):

The occurrence in a community or region of cases of an illness, specific health-related behaviour, or other health-related events clearly in excess of normal expectancy. The community or region and the period in which the cases occur are specified precisely. The number of cases indicating the presence of an epidemic varies according to the agent, size, and type of population exposed,

Page 20: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

20

previous experience or lack of exposure to the disease, and time and place of occurrence.[20]

Because this definition explicitly states that the location and temporal period of disease

cases are vital to determining whether an outbreak qualifies as an epidemic, I selected

locations and dates as two categories of interest for this task. Similarly, because agent

(i.e., pathogen) and type of host population are also listed as being important, I included

them as well.

Since it often is not feasible for humans to manually extract words or phrases

from large textual datasets, it is common to use computational methods that

automatically segment and classify the words or phrases in the text. Indeed, given that

the dataset I am using consists of 1,963 textual reports in total, manually extracting

each word or phrase that is about some epidemic would be labor intensive and time-

consuming. Therefore, for this task I used NER, which is a computational method that

segments and classifies words or phrases that represent named entities in a text

according to pre-selected categories. This method is used extensively for information

extraction tasks in the biomedical and public health domains [21–25].

I define a named entity as a thing that exists in reality and that can be referred to

or denoted by a noun or proper noun. For this project, the named entities in which I am

interested fall into the four categories that I described above (host, pathogen, date/time,

and location). As an example of how these named entities appear in a report, consider

the sentence, “Three pigs were confirmed to have been infected with H1N1 influenza in

Mexico on August 4.” This sentence contains references to four named entities: host

organisms (‘pigs’), an influenza pathogen (‘H1N1 influenza’), a location (‘Mexico’), and a

date (‘August 4’),

Page 21: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

21

There are two main approaches to NER—the rule-based approach and the

learning-based approach—that I will describe here, each of which has its own

advantages and drawbacks. The rule-based approach relies on a set of grammar-based

rules that are either hand-crafted or bootstrapped from previously-developed rule sets,

which are then implemented throughout the entire data extraction process to extract

structured data from unstructured text. An example of what one of these rules might

look like informally is, “identify a match of a location type (e.g., ‘district of’) followed by a

match of a location name (e.g., ‘Shushan’).” This approach has the advantage of

achieving higher precision than the learning-based approach [26]. However, the

performance of systems that use this approach is entirely dependent on how

comprehensive the information extraction rules are and therefore often require a

substantial amount of manual effort to construct, test, and refine. In addition, information

extraction systems that use this approach are limited to the domains that their rules set

covers.

The second main approach that is used for NER is the learning-based approach.

This approach uses statistical models that automatically implement a set of rules at the

beginning of the extraction task to identify certain discriminative features of words (e.g.,

capitalization, the sequence of integers that form a year), which are then used by a

statistical model that implements the rest of the extraction process. The advantages of

this approach are the relatively higher recall that learning-based systems achieve

compared to rule-based systems and the avoidance of having to manually develop a

comprehensive set of rules, which can take several months [27]. Due to time constraints

Page 22: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

22

and to recent improvements in the performance of learning-based systems for NER, I

elected to use the learning-based approach for this task.

Implementations of this approach typically rely on one of three techniques:

supervised learning, semi-supervised learning, or unsupervised learning [28]. The most

common of the three for NER, supervised learning, uses data that people have

manually annotated for positive (and sometimes negative) examples of named entities.

Semi-supervised learning uses a technique called “bootstrapping” that takes a small

number of example names of named entities, called “seeds,” searches for sentences

within a text that contain these names, and then uses their surrounding context to

identify other names elsewhere. Once new names are identified, the process is

repeated using the newly identified names. The third technique, which happens to be

relatively new for NER tasks, is unsupervised learning, which clusters groups of words

that denote or refer to named entities based on similarities in the context in which these

words occur.

By far, the most successful technique of the three for NER is supervised learning

[26,28], which is why I selected it for this task.

Conditional Random Fields

As I mentioned previously, supervised learning techniques use annotated data,

called “training data,” to develop a set of rules for classifying words or phrases in an

unannotated sequence (e.g., a sentence) of text. However, rather than treating each

word in a sentence as individual tokens, these models take into account the

dependencies that exist between words in the sentence. Take, for example, the words

that form the term ‘bird flu virus’. In this example, the words that form the term ‘bird flu

Page 23: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

23

virus’ have individual meanings (i.e., ‘bird’ refers to a type of flying animal that has a

beak, ‘flu’ refers to a type of infectious disease, etc.), but when they appear together like

this in a sentence, they are taken to refer to a particular type of virus. Using these

dependencies and the set of rules for distinguishing certain words (i.e., that were

generated at the beginning of the extraction task during the training step), the model

then predicts the most likely sequence of named entity labels for each word. Models

that take this approach to predicting the labels of sequential data are sometimes called

“sequential labeling models.”

One such model that is commonly used for learning-based NER tasks is the

conditional random fields (CRF) model [29]. To arrive at a sequence of labels for the

words in a sentence, during the training step of NER, a CRF model will first generate

feature functions for each labeled word in the sentence based on the following inputs:

the sentence (s), the position of a word in the sentence (i), the label of the current word

(li), and the label of the previous word (li-1). The output of each feature function will

typically be an integer between 0 and 1. Through this process, the conditional

dependence of li on s is defined by the model as a set of feature functions, such that the

probability of each possible value for li is partially determined by the feature functions. In

the next step, the model assigns each feature function a weight and then sums the

weighted features over all of the words in the sentence. It then uses this sum to predict

the most likely sequence of labels for the sentence.

One of the benefits of CRFs is that they are capable of taking advantage of non-

local information in natural language text. One example of a non-local dependency that

is valuable in NER is the consistent labeling of two instances of the same word that

Page 24: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

24

occur far from each other in a text. Similarly, it is also advantageous to have consistent

labeling of two words that are very similar to each other. For instance, if ‘United States’

is assigned a location label, I would also like for ‘United States of America’ to be

assigned a location label, too. Many other sequence labeling models, such as hidden

Markov models (HMMs) and conditional Markov models (CMMs), cannot account for

these non-local dependencies since they only capture local dependencies of each word

in a sentence (i.e., the label of a word and the label of the previous word). Because of

this capability and the high degree of success that CRFs have seen in NER tasks, I

decided to use CRFs for the NER task of this project.

To achieve the goal of correctly classifying all words and phrases that refer to

influenza pathogens, locations, dates, and host organisms in the epidemic reports, I

used the Stanford Named Entity Recognizer (NER) toolkit for named-entity recognition.

This Java-based toolkit uses a CRF sequence model for classifying named entities

[22,30]. It includes a pre-trained linear chain CRF model for classifying names of

people, organizations, or locations in English texts. In addition, Stanford NER provides

the functionality for users to train, test, and run their own models either

programmatically or from the command line.

Overview of NER task

The NER task of this project consisted of a pre-processing step, corpus

annotation, training and evaluation, and a final data extraction step, each of which is

described below.

Page 25: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

25

Pre-processing step

The pre-processing step (1) cleans the data that was stored in the JSON files

created by the web scraper and (2) breaks them out further into a set of text files, each

one of which contains exactly one report from a JSON file. For each JSON file, the text

content string and respective heading string of each report are taken as input by a

module written (print_datasets.py) in Python 3.5 that first removed any whitespace

characters from the beginning or end of each string and any extraneous punctuation

from the beginning of each string. My motivation for doing this was to prevent any future

issues with the software I used for tokenizing and annotating the reports and also to

clean up the text that contains body of each report so that the it begins with the first

word of the report.

This step was necessary due to the formatting of some web pages that

appended periods to the end of some headers, but in a different tag than what the rest

of the text was located in. For example, some reports had a heading that denoted the

location of an outbreak (e.g., ‘Egypt.’), such that the alphabetical characters of the

heading (i.e., ‘Egypt’) were stored in a <b> tag, while ‘.’ was stored outside of the <b>

tag in the paragraph (<p>) tag that the entire heading and text content were wrapped in

(Figure 3). As a result, this period would end up being prepended to the content string of

each report that followed this heading format.

In addition to removing whitespace from the beginning and end of the string, the

module also replaced any newline (‘\n’), horizontal tab (‘\t’), or return (‘\r’) escape

sequences that may appear within the content string with a single space. These

sequences occasionally appeared in the text as apparent artifacts of poorly formatted

Page 26: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

26

HTML text strings. Since I am only interested in retaining the text of the reports, and not

any of these unexplained escape sequences, I elected to remove them.

The module then writes the report heading string to the first line of a text file and

the content string to the second line to produce a single text file for each report (Figure

4). Finally, it concatenates the positional index of the report in the web page to the date

of the web page, and assigns this as the file name (e.g., the text file containing the first

report from the page for November 7, 2006 would be named “2006_11_07_1”).

Corpus annotation

One of the obvious disadvantages to supervised learning is that it requires a

manually-annotated training dataset. As this task is often tedious and time-consuming

(although usually less so than that of developing rule sets) it is common to use pre-

annotated datasets that have been released by others [31–33]. Unfortunately, to my

knowledge there are no available pre-annotated corpora of text that consist of influenza

epidemic reports, so I had to create a training dataset manually.

I selected thirty web pages randomly from the set of scraped web pages, and

then annotated the individual reports contained in each using the brat rapid annotation

tool (brat) [34]. Brat is a web-based text annotation tool that allows users to create

structured annotations for text corpora. It stores the annotations in “standoff format”

(i.e., in a separate file from the text report being annotated) in an individual annotation

file (.ann). This annotation file takes the base name of its corresponding input text file

such that each input file remains unedited.

Of these thirty web pages, I randomly assigned five to the development set,

twenty to the training set, and another five to the testing set. Because these web pages

Page 27: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

27

contain multiple reports, which I will henceforth refer to as the ‘datasets’, the total

number of development datasets was 24, the total number of training datasets was 94,

and the total number of testing datasets was 22.

As mentioned above, I selected four key named entities of interest—date,

location, influenza pathogen, and host organism—for annotation in accordance with

certain inclusion and exclusion criteria, which I describe next.

I annotated alphanumeric terms as date entities if the term referred to or denoted

a calendar day, month, year, holiday, or season (e.g., fall), or if they consisted of

relative temporal terms, such as ‘last week’ or ‘today’. If an alphanumeric term was

synonymous with ‘influenza virus’ or any of its serotypes, it was given an influenza

pathogen annotation. Terms referring to the pathogenicity of the influenza virus (e.g.,

‘highly pathogenic’ in ‘highly pathogenic H5N1 avian influenza’) were not included in the

annotation.

I assigned location annotations to alphanumeric terms that denoted a

geographical entity (e.g., Mt. Everest) or a geopolitical entity (e.g., United States or

Beijing). I excluded from annotation any descriptor terms that precede or follow a

location term (e.g., ‘state of’ or ‘town of’) that were not part of the proper name of the

entity. For example, I would not annotate ‘state of’ in ‘state of California’ since it is not

part of the proper name of California, whereas I would annotate ‘State of’ in ‘State of

Palestine’ since it is part of the official name of Palestine. I also excluded terms for

cardinal directions (e.g., ‘north’ or ‘southern’) from annotation unless they were part of

the proper name of the location entity. For example, I would not annotate ‘north’ in

‘north Switzerland’, but I would annotate ‘North’ in ‘North America’. Additionally, if a

Page 28: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

28

series of words contained location terms that were sequentially ordered (e.g.,

Gainesville, Florida’), I made sure to separately annotate each term that denoted an

individual location, rather than annotating the entire string as a single location (e.g., in

the previous example, ‘Gainesville’ and ‘Florida’ would each receive individual

annotations). In addition, I excluded from annotation those homonyms of a location term

that denoted a non-location entity, because annotating them would be inaccurate. For

example, I would not annotate the term ‘Washington’ as a location if it was used to

denote the federal government of the United States, rather than to denote the state of

Washington or Washington, DC. Finally, I also excluded from annotation those location

terms that belong to the proper name of a different type of entity, such as a private or

government organization, for example ‘Brazil’ in ‘Ministry of Health of Brazil’.

I assigned host organism annotations to alphanumeric terms that were 1) nouns,

and 2) referred to or denoted one or more organisms infected by influenza virus.

Because the Avian Influenza News Archive contains reports on avian influenza

outbreaks in wild and domesticated animals, as well as humans, the host organism

terms tended to be heterogeneous. I did not annotate host terms that were not in their

noun form (e.g., ‘poultry’ in ‘poultry farm’). In addition, I excluded from annotation any

descriptors that preceded the host term (e.g., ‘cage-free’ in ‘cage-free chickens’).

Finally, if a host term was followed by its scientific name in parenthesis (e.g., ‘pigs (Sus

scrofa)’), I annotated each term individually, rather than as a single host organism.

As it is best practice for any NER task, I evaluated how well-defined the

annotation step is. To do so, I recruited a second annotator to annotate 15 randomly-

selected datasets that I had annotated previously. I then compared the annotation

Page 29: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

29

results from both annotators and calculated an inter-annotator (IAA) score from them

using Cohen’s Kappa coefficient.

Model Training

Training a model is a preliminary step for any task that uses supervised statistical

learning and is defined as the process through which a machine learning algorithm

infers a function, given an annotated input dataset, for accurately predicting particular

output values in unannotated datasets. For NER, this process involves training an

algorithm on a corpus that has been manually annotated with the target feature set,

called a “gold standard corpus,” which the algorithm then uses to predict the class

labels of words or phrases in an unannotated corpus.

I used the Stanford NER toolkit for model training, evaluation, and classification

[30]. Because the output file of the brat rapid annotation tool is in the brat standoff

format, which is not accepted by the Stanford NER toolkit, I first converted the

annotated training dataset to inside-outside (IO) annotation format, which labels each

word with either an entity tag or an ‘O’ tag, using the open-source standoff2corenlp.py

Python 3 script [35]. I then converted the training dataset from the IO format to the

beginning-inside-outside format 2 (BIO2) for training and testing by specifying the

entitySubclassification property in the properties file and setting its value to ‘IOB2’ [36].

This format annotates each word in a corpus with a ‘B’, ‘I’, or ‘O’ according to the

following set of rules: 1) if a word is inside a noun phrase that represents a named

entity, it receives an ‘I’ tag; 2) if a word is at the beginning of a noun phrase, then it

receives a ‘B’ tag; and 2) if it is outside of a noun phrase, it receives an ‘O’ tag.

Page 30: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

30

In many tasks involving the use of statistical models to make predictions,

hyperparameters are distinguished from the standard model parameters that are arrived

at during model training. These hyperparameters are not directly learned from the data

by the model, but rather the user usually sets their values prior to training. They include

a variety of different properties of the model, such as the complexity of the model or the

number of passes that the model should take on the training data . For this task, I used

the default hyperparameter values that are selected by Stanford NER.

To train a classifier, I used the following command on the command line:

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop flu.prop

Here, the location of the properties file—flu.prop—was specified by the ‘-prop’

command. After training a classifier, Stanford NER then serializes the model to a

location specified in the properties file.

Model Evaluation

To assess the performance of the NER model in assigning labels to the words in

unannotated text, researchers commonly use three measures: precision (the fraction of

assigned labels that are accurate), recall (the fraction of all entities that were accurately

labeled), and the F-score (the harmonic average of the precision and recall). Taken

together, these measures help evaluate how well a model performs at correctly labeling

words or phrases that refer to or denote named entities and non-entities, and how well it

performs at recognizing words or phrases that do not represent named entities of

interest.

To compute these measures, I first calculated the number of true positives (TP),

false positives (FP), and false negatives (FN). For any NER task, a true positive is any

Page 31: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

31

named entity label that was correctly assigned to a word or phrase in the corpus, a false

positive is any entity label that was inaccurately assigned to a word or phrase in the

corpus, and a false negative is a word or phrase that refers to or denotes an entity of

interest that was incorrectly labeled with an outside tag.

Precision (p) is the ratio of the number of true positives and the sum of true

positives and false positives:

𝑝 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑃

Recall (r) is ratio of the number of true positives and the sum of true positives

and false negatives:

𝑟 =𝑇𝑃

𝑇𝑃 + 𝐹𝑁

Finally, the F-score is the harmonic mean of precision (p) and recall (r):

𝐹1 = 2 ⋅𝑝 ⋅ 𝑟

𝑝 + 𝑟

I evaluated model performance using the following command, which calculates

the precision, recall, F-score, and the total numbers of true positives, false positives,

and false negatives:

java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier flu-ner-model.ser.gz -testFile ner-cfr-

testing-data.tsv

Here, ‘-loadClassifier’ specifies the location of the trained model and ‘-testFile’ specifies

the location of the annotated testing dataset.

Entity Resolution

Entity resolution (ER) is the process of identifying and linking different mentions

of the same real-world entity and that occur across multiple data sources. For example,

Page 32: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

32

for a given influenza epidemic, there may be multiple articles about it that were serially

published over time in several different online news sources, along with multiple articles

published about other epidemics. The goal of an entity resolver for this task might

therefore be to first identify different, unique epidemics in this set of news articles, and

then link together references that are about one such epidemic. Thus, the end product

is one or more sets of linked articles (or epidemiological data if data extraction was

performed prior to the ER task), where each set of articles is about one particular

epidemic.

For any ER task, as with NER, the approach used to identify and link entities will

typically be either (1) rule-based, in which a set of conditions must be fulfilled in order to

determine whether two sets of data refer to the same entity, or (2) learning-based, in

which statistical learning algorithms are used to predict whether two sets of data refer to

the same entity. While learning-based approaches are useful for data that are

heterogeneous and unstructured, a rule-based approach using a set of rules that

leverage domain-specific knowledge is sometimes easier to implement and can produce

results that are sufficient for the goal(s) of the application.

Because the rules for ER for this task were a much smaller and easier set to

specify than the rules for NER would have been, I decided to use a rule-based

approach for the entity resolver. In particular, I constructed a set of rules for (1)

identifying all the unique epidemics that a given avian influenza epidemic report

mentions, and (2) determining whether each such epidemic is the same one mentioned

by previous or later reports.

Page 33: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

33

Overall, I broke the ER task into five separate steps. The first step is a pre-

processing step that takes as input the tagged corpus from the NER task and

determines where every sentence in each report begins and ends extracts the tagged

words and their associated labels. It then standardizes all date entities to the

International Organization for Standardization (ISO) 8601 format, all country acronyms

to the official country name, and certain host terms from their colloquial name(s) to their

scientific name. The second step involves a term look-up that determines the scientific

names of each tagged host entity. The third step is a term look-up that determines

whether each location entity is a country, administrative country level 1 subdivision

(e.g., a United States (US) state), administrative country level 2 subdivision (e.g., a US

county or parish), or a city. In the fourth step, I generate epidemic tuples for every

epidemic identified in each report. The final step takes the epidemic tuples and performs

epidemic ER.

Pre-processing Step

For sentence boundary detection, the ER module determines the ending of a

sentence by a period, which is the only full stop punctuation that appears in each

epidemic report. Because each report consists of both a heading and a body, the

headings are treated as the equivalent of a sentence in the sentence boundary

detection sub-step. As such, a full stop for a heading can be either a period, a colon, or

a right parenthesis. However, occasionally the body of a report contains a colon or right

parenthesis in the middle of a sentence, which should not be treated as a full stop. To

overcome this, I only considered a colon or right parenthesis to be a full stop if it was

the first full stop identified in the report since this would indicate the end of the heading.

Page 34: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

34

For each sentence within the report, upon determining its ending I extracted each

tagged word from the sentence, if it contained any, along with its respective label.

For the second sub-step of the pre-processing step, I devised a set of rules for

standardizing certain named entity terms. For dates, this set of rules ultimately

produced a date term that was formatted according to the ISO 8601 standard date

format (i.e., yyyy-mm-dd), which made it easier to compute the distance between two

dates in the final step of the ER task. In constructing this set of rules, I tried to account

for dates that were missing a year and/or a day term. For those date terms that were

missing a year, I usually assigned the same year in which the report was published. In

some cases, these reports were describing an epidemic that occurred the previous year

(e.g., reports published in January), and thus I assigned the previous year when

assigning the year, otherwise using the year published would place the date in the

future relative to the date of the epidemic report. For date terms that contained a month

and year, but no day, if the month in the date term was the same as that of the report,

then I applied the following rules: if the day of the report fell within the 1st to the 15th of

the month, ‘01’ was assigned as the day; if the day of the report fell after the 15th of the

month, then ‘15’ was assigned as the day. Finally, if the year within the date term was

different than that of the date of the epidemic report, then the date term was ignored, as

it is unlikely that this particular report is providing information relevant to a past

epidemic. The only exception to this rule is if the report was published at the beginning

of the year and the year within the date term denotes the previous year.

To account for any issues caused by the set of labels for a country not always

containing the country abbreviation, I decided to convert the abbreviations of countries

Page 35: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

35

into the official name of the country before conducting the location term look-up. There

were only three countries I found that were so affected, and thus I included in this step

the United States (US, USA, etc), the United Kingdom (UK), and the United Arab

Emirates (UAE). Similarly, for the host term look-up step, I converted host terms that

were in their colloquial form to terms that are accepted by the database I used. In

particular, I converted all terms referring to humans (e.g., man, girl, patient) to ‘humans’,

the term ‘poultry’ (which was commonly used throughout this corpus) to ‘Galliformes’

(i.e., the name of the taxonomical order that contains most species of poultry), and the

terms ‘piglet’ and ‘piglets’ to ‘Sus scrofa’ (i.e., the species name for domesticated pigs).

Host Population and Pathogen Term Look-up Step

For the second step of the ER component, I queried the NCBI Taxonomy

database to get the scientific name of each host population and pathogen term that was

extracted in the pre-processing step, as well as their respective NCBI Taxonomy

identifiers. The NCBI Taxonomy database is a repository of taxonomical information

about organisms, including organism names and lineage, for the organism data in the

NCBI sequence databases [37,38]. It contains information only on those species of

organisms for which sequence data exist in another NCBI database (about 10% of

known species worldwide). Each organism classification has a standard numeric

identifier that was curated by the NCBI Taxonomy database. The database is

accessible through the NCBI Entrez Application Programming Interface (API), which

provides a set of protocols and tools that allows outside developers to communicate

with its various components. To programmatically connect to and query the database, I

used the BioPython Entrez library version 1.68 [39]. If multiple identifiers were returned

Page 36: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

36

for a single query, I selected the one that was located at the lowest level in the

taxonomical lineage (no queries returned a list of identifiers for organism classifications

that were located in disjoint lineages). Although it contains only 10% of known species,

I found the NCBI Taxonomy’s coverage to be 100% for the host and pathogen species

discussed in the epidemic reports.

Location Term Look-up Step

For the third step of the ER component, I used the tagged location entity terms to

query Wikidata. Wikidata is an open-access knowledge base that contains structured

data extracted from numerous resources (e.g., Wikipedia) that the Wikimedia

Foundation maintains [40]. One of the benefits of using Wikidata is that it provides a

SPARQL endpoint for querying Wikidata and that can be accessed through a web

interface or programmatically via their API. To access this endpoint and submit queries

to Wikidata, I used the Python SPARQLWrapper library version 1.8.2 [41].

As I mentioned previously, one reason for querying Wikidata in this step was to

determine whether a given location term denotes a location that is a country, an

administrative country level 1 subdivision, administrative country level 2 subdivision, a

city, or none of these. This is important because a given sentence might contain

multiple location terms that denote different administrative subdivisions where one is

part of the other. For example, if a sentence describes an epidemic that occurred in the

city of Gainesville and then later mentions the state of Florida, it is important to know

that Gainesville is a city that is part of Florida so that the entity resolver does not

incorrectly infer that two different epidemics—one in Florida and one in Gainesville—

occurred, when only one epidemic in Gainesville occurred.

Page 37: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

37

Another reason for querying Wikidata is to resolve multiple references to the

same location to a single identifier. For instance, if the US is written in the text as both

‘United States’ and ‘United States of America’, I want to resolve both of these text

strings to the same country identifier. This consistency is important because, when I get

to the epidemic ER step, I will need to group epidemics by country. Doing so is much

easier with resolved country identifiers.

To determine the type of administrative unit to which the location text refers

(country, admin 1, admin 2, or city), I executed separate SPARQL queries. If the query

found a match between a location term and the Resource Description Framework

Schema label (rdfs:label) of a Wikidata location individual, it returned the English

rdfs:label for the individual and its Internationalized Resource Identifier (IRI). Thus, if the

individual was a country, this first query returned the rdfs:label and IRI for that

individual, and the ER module then moved on to the next location term.

Otherwise, it would submit a second query to determine whether the location

term denotes an administrative level 1 subdivision. If it retrieves a match, the rdfs:label

and IRI of the individual, as well as the rdfs:label and the IRI of the country that it is part

of, are returned in the query results.

If this query did not return a result, the ER module submitted a third query to

determine whether the location term denotes an administrative country level 2

subdivision. If a match is found, then the rdfs:label and IRI of the individual, the

rdfs:label and IRI of the country, and the rdfs:label and IRI of the administrative country

level 1 subdivision that it is part of are all returned in the query results.

Page 38: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

38

If still this third query did not find a match, then the ER module submitted a final

query to determine whether the term denotes a city, with a successful match returning

the rdfs:label and IRI of the individual, the rdfs:label and IRI of the country, and the

rdfs:label and IRI of the administrative country level 1 subdivision that the city is part of.

If no results are turned, then the module moved on to the next location term.

Epidemic Tuple Generation

In the fourth step of this component, I generated epidemic tuples using the data

from the look-up steps. To accomplish this, I used a set of rules that I constructed from

knowledge of how each epidemic report is structured. In particular, I know that most, if

not all, epidemic reports are about one or more epidemics in one country and that were

caused by one type of avian influenza pathogen.

From this knowledge, I constructed the following set of rules for generating

epidemic tuples: (1) for any given epidemic report, there is at most one influenza

pathogen type that is responsible for all epidemics described therein; (2) for any given

epidemic report, there is at most one country that each epidemic described therein is

located; (3a) if a sentence in an epidemic report contains one or more host population

entity terms and one or more terms denoting a location that is within the country that the

report is about, assign these terms to the same epidemic; (3b) if a sentence does not

contain any location entity terms but does contain one or more host population entity

terms that are distinct from those in the previous sentence, assign those host terms to

the epidemic identified in the previous sentence; (4) if a sentence contains terms

referring to one type of host population entity and terms denoting multiple location

entities, generate an epidemic tuple for each location entity and assign them the host

Page 39: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

39

term that was mentioned; and (5) if no date entity is identified in the report, assign to the

epidemic tuple the 1st of the month if the report was published within the 1st or the 15th

of the month, or the 15th day of the month if the report was published after the 15th of

the month. In addition, I assigned a “report identifier” to each tuple, which I curated

based on the date of the epidemic report and the positional index of the report in the

web page (e.g., the third report for the update published on September 28, 2017 would

receive ‘2017_09_28_3’ as an identifier).

Each epidemic tuple therefore consists of a report identifier, a date in ISO 8601

format, an influenza pathogen identifier, a location identifier, a country identifier, and a

host population identifier.

Epidemic Entity Resolution

In the final fifth step of the ER component, I removed all epidemic tuples that

were an exact duplicate of another tuple. For the deduplication step, I first took each of

the epidemic tuples and removed any that contained an exact match with another tuple

for report identifier, influenza pathogen, location, and country, and whose host

population terms were all listed in the other tuple. In addition, I merged tuples that had

exact matches on report identifier, influenza pathogen, location, and country with

another tuple, but had different host population terms.

I then loaded all the remaining tuples to an SQLite database, and ordered them

by epidemic date, influenza pathogen, country, and location (in that order). Using the

output of this ordering, I programmatically grouped the tuples by influenza pathogen and

country, such that each tuple in a group had the same influenza pathogen term and

country term and had an epidemic date that was less than 60 days apart from at least

Page 40: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

40

one other tuple in the group. The final output of this step was a tuple group consisting of

one or more tuples that are ordered by date and that each (ideally) refer to a single

epidemic.

Figure 2-1. Flow chart of methods.

Page 41: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

41

Figure 2-2. Example web page from the National Wildlife Health Center’s Avian Influenza News Archive. Accessed: 2018 May 29.

Page 42: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

42

Figure 2-3. Images that illustrate heterogeneity in the formatting and use of punctuation in headings of reports. A) Text content of the first influenza report taken from the June 7, 2011 update. B) HTML content for the report in (A) that shows the report heading to be encased in a <b> tag with the exception of the terminal period in the heading, which is outside of this tag at the beginning of the report content. C) Text content of the first influenza report taken from the December 14, 2012 update. D) HTML content for the report in (C) that shows the report heading to be encased in a <b> tag along with its terminal period.

Figure 2-4. Example output file of NLP pre-processing step. The first line contains the heading as it appeared in the original epidemic report, while the lines below it contain the text content of the report.

Page 43: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

43

CHAPTER 3 RESULTS

Web Scraper

The web scraper consists of two sub-modules. The first sub-module

(html_scraper.py), which contains 89 lines of code, downloads the HTML content of the

web pages containing avian influenza epidemic reports and stores them locally. The

second sub-module (scraper.py), which contains 245 lines of code, parses out the text

content of each report and stores it in structured JSON files. The web scraper ultimately

parsed 442 web pages in total. I excluded three pages that were exact duplicates of

previous reports.

Named-entity Recognition

Annotation Guideline Validation

Based on two annotators using the annotation guideline to annotate 15 reports,

the initial round of validation produced a value of 0.8798. Because the value was

high, no re-validation was necessary. However, there are two disagreements that arose

between both annotators that are worth mentioning. The first is related to how to

annotate multiple location or host organism names that occur sequentially in a

sentence. Consider, for example, the sentence, “Avian influenza was isolated from

samples taken from two pigs (Sus scrofa) that died on a farm in Gainesville in Alachua

County, Florida.” With respect to location names, the disagreement centered on

whether to annotate the entire string ‘Gainesville in Alachua County, Florida’ or each

location name that was in the string separately (i.e., ‘Gainesville’, ‘Alachua County’, and

‘Florida’). Similarly, for host organism names, the disagreement was on whether to

annotate the entire string ‘pigs (Sus scrofa)’ or each host name individually (i.e., ‘pigs’

Page 44: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

44

and ‘Sus scrofa’). Because I want the NER module to identify each individual location

name within a report, I revised the guideline to clarify that each individual location and

host organism name needs to be annotated in such circumstances.

The second disagreement was on whether the word ‘virus’ as the singular

reference to an influenza pathogen in a report should be annotated or not. For example,

one sentence might say, “Lab testing confirmed that the virus was of the highly

pathogenic form.” Ultimately, I decided to clarify in the annotation guideline that it should

not be annotated. This is because extracting this term in the data extraction process

provides little to no benefit since I assume that each epidemic report is going to be

about some avian influenza virus.

CRF Model Training

For the initial round of training, I used a baseline set of features (Table 3-1). I

then tested the trained model against the development datasets, and calculated the

number of true positive, false positives, and false negatives for each named entity, and

subsequently the precision (p), recall (r), and F-score of the model (Table 3-2).

Of the four named entities, the model performed best—as measured by F1—at

classifying influenza terms (p=0.9250, r=0.9136, F1=9193), while its worst performance

was on locations (p=0.7500, r=0.6316, F1=0.6857). The second and third best

performances, respectively, were on host organisms (p=0.8375, r=0.8272, F1=0.8323)

and dates (p=0.9167, r=0.7097, F1=0.8000).

For the second round of training, I added three features to the set of baseline

features (Table 3-3). The second model performed considerably better at classifying

locations (p=0.8182, r=0.8289, F1=0.8397), and had modest improvements in precision

Page 45: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

45

for dates (p=0.9565). However, this came at a small cost to its performance with respect

to classifying host organisms and influenza (p=0.8472, r=0.7531, F1=0.7974; and

p=0.8916, r=0.9136, F1=0.9024, respectively).

CRF Model Testing

I selected the second trained model for the NER task due to the considerable

improvement to performance with classifying locations, relative to the first trained

model. For testing, I ran the model on the testing datasets and evaluated its

performance by calculating the precision (p), recall (r), F-score, and number of true

positive, false positives, and false negatives for each named entity (Table 3-2).

Most notably, there was a considerable increase in the precision for host

organisms from that of the second round of development (0.8472 to 1.0000). One

possible explanation for this is that the training data may have contained more unique

host organism names as a proportion of the total number of host organism names in the

data, despite random sampling. The small size of the overall dataset likely would have

contributed to this effect. Despite this, the overall metrics showed little change to the F-

score (0.8397 to 0.8463), a modest increase in precision (0.8627 to 0.9220), and a

modest decrease in recall (0.8178 to 0.7821), when compared to the results for the

training data (Table 3-2). The increase in precision to 1.0 for dates and hosts was likely

offset by a reduction in the recall for locations (0.8289 to 0.7632) such that the overall

effect on the F-score was minimal.

Entity Resolution

I performed entity resolution on the tagged corpus that was generated by the

NER module. From the 1,963 reports that the web scraper module extracted, the entity

Page 46: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

46

resolution module produced at least one epidemic tuple for 1,192, generating 3,461

epidemic tuples in total. After the deduplication and merge steps, 2,524 epidemic tuples

remained. From these 2,524 tuples, the entity resolution module identified 1,144

epidemic individuals, each of which contains the identifiers of the reports that described

it, the range of dates in which the epidemic occurred, the influenza pathogen that was

its cause, the locations where it occurred, the country in which it occurred, and the host

organisms that were infected.

From these data, I was able to further analyze the number of avian influenza

epidemics that occurred in each country from November 7, 2006 to September 28, 2017

(Figure 3-1), the number of epidemics that occurred in each year (Figure 3-2), the

number of times each host organism was involved in an epidemic (Figure 3-3), and the

number of times each influenza subtype was the cause of an epidemic (Figure 3-4).

In total, there were 68 countries that were the location of an epidemic from

November 7, 2006 to September 28, 2017, 52 different types of host organisms that

were a participant in an epidemic, and 17 different avian influenza subtypes that were

the cause of epidemics in this time period. China was the most common location

(n=213 identified epidemics or 18.62%); birds were the most common host (n=535

identified epidemics or 46.77%); and H5N1 was the most common influenza subtype

(n=735 identified epidemics or 65.25%).

Page 47: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

47

Table 3-1. Baseline set of features selected for the first round of training from the Stanford NER NERFeatureFactory.

Feature Selected Value

maxLeft 1

useClassFeature true

useWord true

useNGrams true

maxNGramLeng 2

usePrev true

useNext true

useWordPairs true

useSequences true

usePrevSequences true

useTypeSeqs true

useTypeSeqs2 true

useTypeySequences true

wordShape chris2useLC

Page 48: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

48

Table 3-2. Summary of the CRF model performance for the first and second rounds of training, and the final round of testing.

Entity P R F1 TP FP FN

Ro

un

d 1

Date 0.9167 0.7097 0.8000 22 2 9

Host organism 0.8375 0.8272 0.8323 67 13 14

Influenza 0.9250 0.9136 0.9193 74 6 7

Location 0.7500 0.6316 0.6857 48 16 28

Totals 0.8508 0.7844 0.8162 211 37 58

Ro

un

d 2

Date 0.9565 0.7097 0.8148 22 1 9

Host organism 0.8472 0.7531 0.7974 61 11 20

Influenza 0.8916 0.9136 0.9024 74 9 7

Location 0.8182 0.8289 0.8235 63 14 13

Totals 0.8627 0.8178 0.8397 220 35 49

Fin

al

Date 1.0000 0.6957 0.8205 16 0 7

Host organism 1.0000 0.7083 0.8293 68 0 28

Influenza 0.9516 0.9516 0.9516 59 3 3

Location 0.8056 0.7632 0.7838 58 14 18

Totals 0.9220 0.7821 0.8463 201 17 56

P=precision; R=recall; F1=F-score; TP=true positive; FP=false positive; FN=false negative.

Table 3-3. Additional features added to the baseline features for the second round of model training.

Feature Selected Value

entitySubclassification “IOB2”

noMidNGrams true

useDisjunctive true

Page 49: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

49

Figure 3-1. Number of avian influenza epidemics by country from 2006 to 2017 (top 15 shown).

Figure 3-2. Number of avian influenza epidemics by year from 2005 to 2017.

Page 50: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

50

Figure 3-3. Number of avian influenza epidemics by host from 2006 to 2017 (top 15 shown).

Figure 3-4. Number of avian influenza epidemics by influenza pathogen from 2006 to 2017.

Page 51: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

51

CHAPTER 4 DISCUSSION

My goal for this project was to transform unstructured text data in web reports

about avian influenza outbreaks into query-able, structured data about individual

epidemics. I developed a method to extract data about dates, locations, host

populations, and influenza pathogens from these reports, and then use those data to

link multiple reports about the same epidemic over a period of time. This method

identified 1,144 individual epidemics that were reported in the USGS Avian Influenza

Archive from November 7, 2006 to September 28, 2017 and that occurred globally in

humans, avian, and other animal species.

Given the small size of the training dataset, the overall performance of the NER

classifier on the testing dataset for all four named entities is noteworthy (p=0.9220,

r=0.7821, F1=0.8463). Both date and host organism had high scores for precision

(p=1.0000 for each), however this came at the cost of recall for both (r=0.6957 and

r=0.7083, respectively). For date, the number of true positives was 16 and the number

of false negatives was seven. Of these seven false negatives, three were for relative

terms for dates (e.g., ‘last week’, ‘this year’). Meanwhile, for host organism the number

of true positives was 68 and the number of false negatives was 28. Most of these false

negatives occurred with host terms that were uncommon in the corpus (e.g., ‘cat’,

‘Oriental White eyes’, ‘black-crowned night heron’).

Of the four named entities, influenza pathogen had the best results (p=0.9516,

r=0.9516, F1=0.9516). This was probably due to the lack of heterogeneity in the terms

used to refer to avian influenza and avian influenza subtypes. Of the three false

negatives, two were for terms that were rarely used in the entire corpus (i.e., ‘swine flu’

Page 52: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

52

and ‘seasonal influenza’). Meanwhile, the model mistakenly labeled ‘H5N1’ in ‘H5N1

vaccine’ and ‘2009’ in ‘H1N1 (2009)’ as influenza pathogens.

On the other hand, location had the poorest results of the four named entities

(p=0.8056, r=0.7632, F1=0.7838). Of the 14 false positives, four occurred on the

commas separating two location names, as in the string ‘Detroit, Michigan’, while the

rest were on names of people, organizations, or dates. Most of the false positives

occurred on terms denoting villages, cities, and continents that were mentioned fewer

than two times in the training and testing datasets.

Taken together, although the performance of the NER model was satisfactory,

the results suggest that a larger dataset may contribute to improvements in

performance. Nevertheless, that my model achieved an F-sore of 0.8463 with such a

small training dataset has promising implications for the application of CRF models to

epidemiological data extraction tasks, especially ones that involve names of pathogens

that are new or rare for which few text-based datasets exist.

Another interesting result is the number of reports in which an epidemic individual

was identified (1,192). This means that out of 1,963 total reports, the entity resolver was

only able to identify complete mentions of epidemics (date, location, pathogen, host) in

60.72% of all reports. One reason for this result might be that many of the reports in the

web pages were not about epidemics, but instead were about research breakthroughs

and initiatives, ecological reports, and public health guidelines, to name a few. It is

possible, then, that many of these reports did not contain terms denoting locations or

host organisms. Because an epidemic tuple is not generated if one or both of these

named entities is not identified, this would have partly contributed to this discrepancy.

Page 53: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

53

Classification errors in the NER step also likely contributed, as the misclassification or

omission of a label for a named entity term would also affect whether a host organism

and location are identified in a report. In addition, spelling mistakes in the host organism

and location terms would cause the term look-up steps for each to fail for those terms,

which results in an epidemic tuple not being generated even if the host organism or

location named entities were correctly labeled in the NER step. Finally, because the

entity resolver only performed a location look-up for certain types of geographical

regions (countries, administrative level 1 country subdivisions, administrative level 2

country subdivisions, and cities), many location terms that denoted things like towns

and villages were overlooked. Including these types of locations in the future may

therefore lead to an increase in the number of identified epidemics.

The end result of the toolset I developed and its application to the USGS avian

influenza reports is a dataset about avian influenza epidemics. One end goal of creating

this dataset of discrete data about unique epidemics—with links to the reports that

discuss them—was to create the ability to apply standard data analysis techniques

(including epidemiological analysis) to understand the burden of avian influenza better.

I demonstrated that my final, avian epidemic influenza dataset enables these

capabilities by conducting a basic analysis of the locations, hosts, pathogens, and dates

of the epidemics.

The 1,144 identified epidemics occurred in 68 countries, with the majority being

clustered within southeast Asia. By far, China had the largest number of identified

epidemics at 213, while Indonesia had the second highest number at 120. These results

are not surprising, given that avian influenza is endemic to southeast Asia.

Page 54: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

54

With respect to the number of epidemics by year, the results show three peaks

during the time period covered by the reports. The first peak is in 2009, which coincides

with the H1N1 swine influenza pandemic. Indeed, the overwhelming majority of H1N1

epidemics that were identified in the reports (13 out of 16) occurred in 2009, although

this hardly accounts for why the number of identified epidemics (114 epidemics) was so

high for this year. The second peak was in 2011, which had 134 identified avian

influenza epidemics. The third peak (also the highest peak) occurred in 2015 and 2016,

with 149 epidemics each. Interestingly, Chatziprodromidou et al. [42] performed a

systematic review of the scientific literature and ProMED reports about avian influenza

epidemics from 2010 to 2016 and found that 2016 had the highest number of avian

influenza epidemics in that time period (144 in total), while 2015 had the second highest

number (142 in total). My results are remarkably consistent with these numbers.

The host populations mentioned in the epidemic reports mainly fell into three

groups—birds, humans, and pigs. With respect to birds, chickens (Gallus gallus), ducks

(Anas), and geese (Anser) were the most common hosts mentioned in the epidemic

reports (Aves is the class for birds, and Galliformes is the order that contains most

species of poultry). This result is not surprising since chickens are some of the most

common birds grown on poultry farms, which frequently are the source of avian

influenza epidemics [43]. Furthermore, waterfowl, which include wild ducks and geese,

are considered to be common reservoir hosts for low pathogenic avian influenza

subtypes [44]. Interestingly, as with the results for the number of epidemics by year,

these findings align with those of Chatziprodromidou et al. [42], which found that avian

Page 55: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

55

influenza epidemics from 2010 to 2016 affected commercial poultry more than any other

type of host.

The breakdown of the number of epidemics by avian influenza subtype is

interesting in that the overwhelming majority (65.2%) of identified epidemics involve the

H5N1 subtype (735 identified epidemics). Although Chatziprodromidou et al. [42] also

found that H5N1 was the most common subtype in reported epidemics, they calculated

that it was the causal pathogen in only 38.2% of avian influenza epidemics from 2010 to

2016. This percentage differs considerably from my result (65.2%). Even when limiting

my epidemic dataset to the time period 2010 to 2016, the discrepancy is still large:

56.9% vs. 38.2%. One possible reason for this discrepancy might be that the USGS

placed greater emphasis on curating epidemic reports for H5N1 due to the public health

risk that it poses and to the high economic burden it places on the poultry industry within

the United States and globally, relative to other avian influenza subtypes. Much of the

concern surrounding H5N1 has been placed particularly on the highly pathogenic form,

which has the capability to infect several different host species, including humans, and

has a high mortality rate.

Limitations

Taken together, my results illustrate the feasibility of identifying and linking

individual epidemics that are reported by multiple epidemic reports over time using the

methods that I have described here. However, there are several limitations to this work

that need to be considered if these methods are to be extended to other online data

sources or to other types of pathogens.

Page 56: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

56

One notable limitation is the small size of the training dataset, which may have

negatively impacted the performance of the NER model. Whereas used 94 datasets for

training a model, many NLP tasks typically have a training dataset that is two or more

orders of magnitude larger. Despite this limitation, the NER model that I trained was still

achieved good overall results for precision (0.9220), recall (0.7821), and F1 score

(0.8463). One explanation for this performance might be that a well-defined and clear

annotation guideline contributed to consistent and precise labeling of named entities in

the training dataset, which therefore contributed to a better-trained model. Because

inconsistencies in how certain named entities, such as locations, are labelled can

negatively affect a model’s ability to identify instances of them in text, it is possible that

minimizing such consistencies helped to boost the overall performance of the model.

Also, because the reports were almost entirely constrained to describing avian influenza

epidemics (and I excluded many reports that did not), it is likely that there is significant

regularity to the text in the reports that makes the task simpler than other NER tasks.

Whether these results would transfer to other online sources like ProMED mail, which

for example is not restricted to influenza pathogens and to the subset of hosts that they

infect, is uncertain.

In addition, online text-based data may be more prone to spelling and

typographical errors, relative to official sources, that can make the process of textual

data extraction more challenging and negatively impact results. With respect to the

methods used here, spelling mistakes in location or host names might have resulted in

those terms not returning results in their respective look-up steps in the entity resolver.

For example, in one report there were two mentions of Kaohsiung City—one that had

Page 57: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

57

the correct spelling and one that incorrectly spelled it as ‘Kaosiung City’. A separate

report misspelled it as ‘Kaoshiung City’. The most likely consequence of misspellings

(and the fact that I did not attempt to apply any spelling correction techniques) is that my

tools would not identify the epidemic in that location or in that host population. The

effect of misspellings overall would be to decrease the sensitivity of my toolset for

detecting epidemics.

Additional limitations exist with respect to the generalizability of the methods and

tools that I have described. Due to the unique formatting of the web pages that contain

the epidemic reports, the web scraper that I have described here would not likely be

able to extract the text content of epidemic reports from other sources of data. In

addition, it has been previously shown that there are issues with the generalizability of

learning-based NER models, especially in diverse domains with limited training data,

[45]. Given the diversity in the online data sources that are used for disease

surveillance, the NER model that I trained on a subset of these epidemic reports likely

would not perform as well on a different corpus, especially a more general one like

ProMED Mail which is unrestricted with respect to pathogens and hosts (as opposed to

the pathogens and hosts involved with avian influenza). As such, if one were to extend

these methods to other data sources, it would be necessary to develop, train, and

evaluate a new web scraper and NER model.

Another limitation is that any classification errors that arise in the NER task

almost certainly affect the performance of the entity resolver later on. Although such

errors are unavoidable, the misidentification of certain named entities as being another

type of named entity or as not being any of the four types of named entities likely

Page 58: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

58

resulted in several epidemics being omitted from the results. With respect to host

organism and location, this is especially true. For date entities, this type of error would,

at best, produce an epidemic date that is several days inaccurate, and at worst would

likely cause an epidemic tuple to not be linked to the epidemic that it contains data on.

Likewise, imperfections in the entity resolution task might also significantly skew

the results, as insufficient resolution could lead to over-counting epidemics and over-

resolution could lead to under-counting epidemics. With respect to the latter, if multiple

epidemics were reported to have occurred within 60 days of each other in countries that

have a large geographical area, it is possible that they may have been resolved to a

single epidemic, even if they occurred far from each other and therefore are different

epidemics. Because I did not measure the performance of the entity resolution task,

there is no way of knowing the exact degree to which my results truly reflect the number

of epidemics that were mentioned by the epidemic reports. Nevertheless, I am

reasonably confident that the epidemics that my entity resolver identified do represent

real epidemics, especially given how well my results align with those of

Chatziprodromidou et al. [42]. Nevertheless, it would be of benefit to future extensions

of this work to properly evaluate the performance of the entity resolver to ensure that

the most accurate count of epidemics is being achieved.

Future Work

There are several different ways in which this work can be improved upon and

extended in the future. As I already mentioned, one way is to evaluate the performance

of the entity resolver at identifying and linking epidemic individuals. One way to achieve

this task is to recruit a domain expert to manually extract each individual epidemic that

Page 59: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

59

is mentioned by a set of epidemic reports, and then compare the expert’s results to the

results of the entity resolver on that particular set of reports.

In addition, rather than using a rule-based approach to entity resolution, I could

test more sophisticated approaches that employ graph similarity measures or machine

learning, and then compare the results to those that I achieved here. The extent to

which any improvement to the results would be noticeable, however, is uncertain,

especially if the results that I arrived at here come close to the number of epidemics that

were reported on.

One task that I did not attempt is the extraction of epidemiological data, such as

the number of cases or the mode or frequency of interspecies transmission, from the

reports. As these data are frequently of high value to epidemiologists, the ability to

extract them in an automated fashion would be beneficial.

Another approach that might increase the utility of this work for others would be

to store the extracted data about these epidemics as linked data on the Semantic Web.

This would help to improve the machine-readability and standardization of the data, the

latter of which is noticeably lacking in the field of computational epidemiology.

Moreover, this might also help to improve the ability to integrate the dataset with other

resources in the future. Nevertheless, my final epidemic data set does use standard

identifiers for host and pathogen, formats dates according to international standards

(ISO8601), and uses a popular Semantic Web database (i.e., Wikidata) for locations.

Finally, it would be interesting to extend these methods to other data sources,

such as ProMED-mail or online news aggregators to see if similar results can be

achieved. Doing so would be the ultimate test of generalizability, as it would illustrate

Page 60: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

60

the effectiveness of these methods at extracting data from multiple data sources that

differ greatly in format.

Page 61: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

61

APPENDIX ANNOTATION GUIDELINE

The purpose of this guideline is to provide a standard for annotating the given text corpus that is to be used by the annotator as a reference. For this particular task, the annotator will receive a set of 15 “datasets” that are each comprised of a brief report on an influenza outbreak. It is important that the annotator read through each report carefully so as not to miss or misidentify any of the four entities that will be outlined below. If at any point the annotator is unsure of whether a term deserves an annotation or not, they can refer back to this guideline. Entities of interest:

• Date

• Location

• Influenza pathogen

• Host organism For each of the aforementioned entities, the following rules for annotation apply: General

• Always annotate abbreviations that refer to or denote one of the above entities.

• For terms that consist of more than one word, always annotate the entire set of words as one term. For example, ‘high pathogenicity H5N1 avian influenza virus’ would receive one annotation.

• The plural form of a word can be annotated. Date

• Any alphanumeric term that refers to or denotes a day, month, year, season, or holiday, as well as relative temporal terms, such as “last week” or “today.”

Location

• Any alphanumeric term that refers to or denotes a geographical entity, such as ‘Lake Erie’ or ‘Mt. Everest’.

• Any alphanumeric term that refers to or denotes a geopolitical entity, such as ‘United States of America’ or ‘Paris’

o Include countries, states/provinces, cities, towns, etc.

• DO NOT annotate any descriptors that precede a term if they are not part of the proper name of that term, for example ‘state of’ in ‘state of California’ would not be annotated, but ‘State of’ in ‘State of Palestine’ would.

• DO NOT annotate any cardinal directions, such as ‘northwest’ or ‘southern’, unless they are part of the proper name of that term. For example, ‘north’ in ‘north Switzerland’ would not be annotated, but ‘North’ in ‘North America’ would.

• DO NOT annotate any homonyms of a location term that denote some other entity that does not meet the first two criteria (this requires that you pay careful attention to the context that a term is used in). For example, ‘Washington’ can denote either Washington, DC or Washington state, in which case the term would

Page 62: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

62

be annotated; however, it can also be used to denote the federal government of the United States, and therefore would not receive one.

• DO NOT annotate a term that belongs to the proper name of a different type of entity, such as the name of a company, private organization, or government organization or branch. For example, ‘Canada’ in ‘Ministry of Health of Canada’ would not be annotated.

• DO NOT annotate a series of locations that occur in sequence to one another (e.g., ‘Gainesville, FL’) as a single location. Instead, annotate each location name separately.

Influenza pathogen

• Any alphanumeric term that is synonymous with ‘influenza virus’ or any of its serotypes.

• Any alphanumeric term that refers to the pathogenicity of the virus (i.e., ‘high pathogenicity’ or ‘low pathogenicity’, or any variant thereof).

Host organism

• Any alphanumeric term that is in the noun form that refers to or denotes one or more host organisms. For example, ‘poultry’ in ‘H7N9 was responsible for the death of 8 poultry’ would be annotated, but ‘poultry’ in ‘poultry farm’ would not.

• DO NOT annotate any descriptors that precede a term that do not have any taxonomical meaning, such as ‘cage-free’ in ‘cage-free chickens’ or ‘wild’ in ‘wild birds’.

• DO NOT annotate a series of host organisms that occur in sequence to one another (e.g., ‘pigs (Sus scrofa)’) as a single host organism. Instead, annotate each organism name separately.

Page 63: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

63

LIST OF REFERENCES

1 Ginsberg J, Mohebbi MH, Patel RS, et al. Detecting influenza epidemics using search engine query data. Nature 2009;457:1012–4. doi:10.1038/nature07634

2 Polgreen PM, Chen Y, Pennock DM, et al. Using internet searches for influenza surveillance. Clin Infect Dis Off Publ Infect Dis Soc Am 2008;47:1443–8. doi:10.1086/593098

3 Santillana M, Nguyen AT, Dredze M, et al. Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance. PLoS Comput Biol 2015;11:e1004513. doi:10.1371/journal.pcbi.1004513

4 Lazer D, Kennedy R, King G, et al. Big data. The parable of Google Flu: traps in big data analysis. Science 2014;343:1203–5. doi:10.1126/science.1248506

5 Hickmann KS, Fairchild G, Priedhorsky R, et al. Forecasting the 2013-2014 influenza season using Wikipedia. PLoS Comput Biol 2015;11:e1004239. doi:10.1371/journal.pcbi.1004239

6 McIver DJ, Brownstein JS. Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time. PLoS Comput Biol 2014;10:e1003581. doi:10.1371/journal.pcbi.1003581

7 Nagar R, Yuan Q, Freifeld CC, et al. A case study of the New York City 2012-2013 influenza season with daily geocoded Twitter data from temporal and spatiotemporal perspectives. J Med Internet Res 2014;16:e236. doi:10.2196/jmir.3416

8 Paul MJ, Dredze M, Broniatowski D. Twitter improves influenza forecasting. PLoS Curr 2014;6. doi:10.1371/currents.outbreaks.90b9ed0f59bae4ccaa683a39865d9117

9 Chowell G, Cleaton JM, Viboud C. Elucidating Transmission Patterns From Internet Reports: Ebola and Middle East Respiratory Syndrome as Case Studies. J Infect Dis 2016;214:S421–6. doi:10.1093/infdis/jiw356

10 Mykhalovskiy E, Weir L. The Global Public Health Intelligence Network and early warning outbreak detection: a Canadian contribution to global public health. Can J Public Health Rev Can Sante Publique 2006;97:42–4.

11 Freifeld CC, Mandl KD, Reis BY, et al. HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports. J Am Med Inform Assoc JAMIA 2008;15:150–7. doi:10.1197/jamia.M2544

12 Anema A, Kluberg S, Wilson K, et al. Digital surveillance for enhanced detection and response to outbreaks. Lancet Infect Dis 2014;14:1035–7. doi:10.1016/S1473-3099(14)70953-3

Page 64: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

64

13 Cleaton JM, Viboud C, Simonsen L, et al. Characterizing Ebola Transmission Patterns Based on Internet News Reports. Clin Infect Dis 2016;62:24–31. doi:10.1093/cid/civ748

14 Cauchemez S, Fraser C, Van Kerkhove MD, et al. Middle East respiratory syndrome coronavirus: quantification of the extent of the epidemic, surveillance biases, and transmissibility. Lancet Infect Dis 2014;14:50–6. doi:10.1016/S1473-3099(13)70304-9

15 Health Threats Unit at Directorate General Health and Consumer Affairs of the European Commission. MedISys (Medical Intelligence System). http://medisys.newsbrief.eu/medisys/homeedition/en/home.html (accessed 30 Jun 2018).

16 Yu VL, Madoff LC. ProMED-mail: An Early Warning System for Emerging Diseases. Clin Infect Dis 2004;39:227–32. doi:10.1086/422003

17 National Wildlife Health Center. USGS National Wildlife Health Center - Avian Influenza News Archive. https://www.nwhc.usgs.gov/disease_information/avian_influenza/ (accessed 5 Jul 2018).

18 The Python Language Reference — Python 3.5.5 documentation. https://docs.python.org/3.5/reference/ (accessed 29 May 2018).

19 Beautiful Soup website. https://www.crummy.com/software/BeautifulSoup/? (accessed 11 Jun 2018).

20 WHO | Definitions: emergencies. WHO. http://www.who.int/hac/about/definitions/en/ (accessed 11 Jun 2018).

21 Collier N, Doan S, Kawazoe A, et al. BioCaster: detecting public health rumors with a Web-based text mining system. Bioinformatics 2008;24:2940–1. doi:10.1093/bioinformatics/btn534

22 Jimeno A, Jimenez-Ruiz E, Lee V, et al. Assessment of disease named entity recognition on a corpus of annotated sentences. BMC Bioinformatics 2008;9:S3. doi:10.1186/1471-2105-9-S3-S3

23 Leser U, Hakenberg J. What makes a gene name? Named entity recognition in the biomedical literature. Brief Bioinform 2005;6:357–69. doi:10.1093/bib/6.4.357

24 Tsai RT-H, Sung C-L, Dai H-J, et al. NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 2006;7:S11. doi:10.1186/1471-2105-7-S5-S11

Page 65: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

65

25 Zhao S. Named entity recognition in biomedical texts using an HMM model. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications. Association for Computational Linguistics 2004. 84–7. doi:10.3115/1567594.1567613

26 Mohit B. Named Entity Recognition. In: Natural Language Processing of Semitic Languages. Springer, Berlin, Heidelberg 2014. 221–45. doi:10.1007/978-3-642-45358-8_7

27 Kapetanios E, Tatar D, Sacarea C. Natural Language Processing: Semantic Aspects. In: Natural Language Processing: Semantic Aspects. CRC Press 2013. 298.

28 Nadeau D, Sekine S. A survey of named entity recognition and classification. Lingvisticae Investig 2007;30:20.

29 Lafferty J, McCallum A, Pereira FCN. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. 2001;:10.

30 Finkel JR, Grenager T, Manning C. Incorporating non-local information into information extraction systems by Gibbs sampling. Association for Computational Linguistics 2005. 363–70. doi:10.3115/1219840.1219885

31 Chinchor N. Message Understanding Conference (MUC) 7 LDC2001T02. 2001.https://catalog.ldc.upenn.edu/LDC2001T02 (accessed 6 Jul 2018).

32 Ratinov L, Roth D. Design challenges and misconceptions in named entity recognition. Association for Computational Linguistics 2009. 147. doi:10.3115/1596374.1596399

33 Doğan RI, Leaman R, Lu Z. NCBI disease corpus: A resource for disease name recognition and concept normalization. J Biomed Inform 2014;47:1–10. doi:10.1016/j.jbi.2013.12.006

34 Stenetorp P, Pyysalo S, Topić G, et al. brat: a Web-based Tool for NLP-Assisted Text Annotation. In: Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics. Avignon, France: : Association for Computational Linguistics 2012. 102–107.http://www.aclweb.org/anthology/E12-2021 (accessed 30 May 2018).

35 standoff2corenlp.py. https://gist.github.com/thatguysimon/6caa622be083f97b8c5c9a10478ba058 (accessed 30 May 2018).

36 Ratnaparkhi A. Maximum Entropy Models For Natural Language Ambiguity Resolution. 1998.https://repository.upenn.edu/ircs_reports/60

Page 66: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

66

37 Federhen S. The NCBI Taxonomy database. Nucleic Acids Res 2012;40:D136–43. doi:10.1093/nar/gkr1178

38 Sayers EW, Barrett T, Benson DA, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2009;37:D5-15. doi:10.1093/nar/gkn741

39 Bio.Entrez module. https://biopython.org/DIST/docs/api/Bio.Entrez-module.html (accessed 9 Jul 2018).

40 Vrandečić D, Krötzsch M. Wikidata: A Free Collaborative Knowledgebase. http://korrekt.org/page/Wikidata:_A_Free_Collaborative_Knowledgebase (accessed 9 Jul 2018).

41 Fernández S, Tejo C, Herman I, et al. SPARQLWrapper: SPARQL Endpoint interface to Python. https://rdflib.github.io/sparqlwrapper/ (accessed 9 Jul 2018).

42 Chatziprodromidou IP, Arvanitidou M, Guitian J, et al. Global avian influenza outbreaks 2010–2016: a systematic review of their distribution, avian species and virus subtype. Syst Rev 2018;7. doi:10.1186/s13643-018-0691-z

43 Sims LD, Domenech J, Benigno C, et al. Origin and evolution of highly pathogenic H5N1 avian influenza in Asia. Vet Rec 2005;157:159–64.

44 Garamszegi LZ, Møller AP. Prevalence of avian influenza and host ecology. Proc R Soc Lond B Biol Sci 2007;274:2003–12. doi:10.1098/rspb.2007.0124

45 Augenstein I, Derczynski L, Bontcheva K. Generalisation in named entity recognition: A quantitative analysis. Comput Speech Lang 2017;44:61–83. doi:10.1016/j.csl.2017.01.012

Page 67: A WEB SCRAPER AND ENTITY RESOLVER FOR CONVERTING …ufdcimages.uflib.ufl.edu/UF/E0/05/28/23/00001/DILLER_M.pdf · changes to the text or the Hypertext Markup Language (HTML) documents

67

BIOGRAPHICAL SKETCH

Matt’s major was in medical sciences with a concentration in biomedical

informatics. He graduated with a Master of Science in the summer of 2018 while

working as a research assistant for the Informatics Services Group of the Models of

Infectious Disease Agent Study.