SIDEFFECTIVE - SYSTEM TO MINE PATIENT REVIEWS: SIDE …

SIDEFFECTIVE - SYSTEM TO MINE PATIENTREVIEWS: SIDE EFFECT EXTRACTION

BY Sangeetha Rajagopalan

A thesis submitted to the

Graduate School—New Brunswick

Rutgers, The State University of New Jersey

in partial fulfillment of the requirements

for the degree of

Master of Science

Graduate Program in Computer Science

Written under the direction of

Prof. Tomasz Imielinski

and approved by

New Brunswick, New Jersey

May, 2011

Abstract of the Thesis

Sideffective - System to mine patient reviews:Side Effect Extraction

by Sangeetha Rajagopalan

Thesis Director: Prof. Tomasz Imielinski

Sideffective is the system to crawl, rank and analyze patient testimonials about

side effects from common medications. Since the wealth of any mining model is

the Data corpus, the data collection phase involved extensive crawling of massive

medical websites comprised of user forums from the internet. Subsequently, the

raw files were subjected to certain site-specific parsing routines, yielding outputs

conforming to a well-defined data model. Currently, the system holds close to

400,000 user testimonials pertaining to more than 2500 drugs/medicines. Sidef-

fective aims at gathering and aggregating this wealth of information, build useful

associations and present interesting observations and numeric validations, all in

a user-friendly interface. The important issues that we have tried to tackle are:

Extracting side effects without relying on pre-built lists, aggregating distribution

of different side effect for a give drug, site-specific search, ranking and determining

the negativity of reviews.

The main focus of this thesis undertaking is Extraction & Discovery of Side-effect

from a users review about a drug. Apache Lucenes Shingle Analyzer, which ex-

tracts terms and their frequency, was used to generate more than 7 million phrases

out of which the top 25,000 terms, with frequencies more than 100 was chosen for

discovering side effects. After eliminating the syntactically incorrect phrases, our

method calculates the frequency of occurrence of each of the terms in a medical

websites domain versus a purely non-medical user websites domain, which proves

ii

to be highly effective in extracting side effects. Using this technique, more than

600 unique side effects reported by users has been discovered without using any

fixed lists. This list extracted is also used to mine and summarize patients reviews.

The aggregation and distribution tables we built, effectively determine top reac-

tions exhibited by various drugs and reverse mapping of the same, demonstrating

the symptom to drug associations. Our system also eliminates synonymous side

effects as well as cures falsely appearing as a possible side effect.

iii

Acknowledgements

I would like to thank Prof. Tomasz Imielinski for all the valuable support and

constant encouragement throughout the period of my graduate study. He has

been exceptionally motivational at every step and provided the right guidance in

achieving results. I have a learnt a great deal from him in my time here and I am

very grateful to him for having given me this opportunity to work on this thesis

project.

I would like to thank Prof. Apostolos Gerasoulis for his invaluable inputs and

suggestions. His guidance during the early stages of the project helped me to steer

the work in the right direction. I would also like to thank Prof. Alex Borgida for

his sharp and insightful ideas during our discussion. I feel greatly privileged to

have had an opportunity to interact with him.

Finally, I would like to thank my project partner Deepak Yalamanchi for all the

support and encouragement. I am also greatly thankful to my parents, sister and

my friends at Rutgers University for being my pillars of strength.

iv

Dedication

This work is dedicated to my parents and sister for their constant encouragement

and invaluable support.

v

Contents

Abstract ii

Acknowledgements iv

Dedication v

List of Figures viii

List of Tables ix

1 Introduction 1

1.1 Problem Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Data Collection 6

2.1 Choice of Websites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Crawling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 What is Web Crawling? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 HTTrack Web-Crawler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.3 Gathering Data for Sideffective . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Parsing HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Why parse raw files? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.2 HTML Parser JAVA Library . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Data Model and Associations 18

3.1 Extraction of useful data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Creating Association and dependency models . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Defining Database Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.2 Populating database tables . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Data Corpus Harvesting 27

4.1 Side effect discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.1.1 Building n-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.2 Preliminary Filtering of phrases . . . . . . . . . . . . . . . . . . . . . . . . 32

vi

4.1.3 Term Extraction Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Determining Top side effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2.1 Section I: Top Side-effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.2 Section II: Graphical Representation . . . . . . . . . . . . . . . . . . . . . 42

4.2.3 Section III: User Testimonials . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 Determining Top Drugs for each Symptom . . . . . . . . . . . . . . . . . . . . . . 43

5 Discussion 45

5.1 Challenges & Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.1 Eliminating Cures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.1.2 Synonymous Side-effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Profoundness Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6 Conclusions and Future Work 52

7 Bibliography 54

vii

List of Figures

1 Basic web-crawler architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 A sample screenshot demonstrating the first step in HTTrack Crawler where the

website to be crawled and action to be performed is specified. . . . . . . . . . . 11

3 A sample screenshot demonstrating the options screen of HTTrack . . . . . . . . 13

4 Sample screenshot demonstrating the final step of HTTrack web crawler . . . . . 14

5 A portion of the webpage on the site: www.medications.com from which user

reviews have been extracted. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

6 A portion of the webpage on the site: www.askapatient.com from which user

reviews have been extracted, with a different page structure. . . . . . . . . . . . . 15

7 Code Snippet demonstrating the Parser class to eliminate HTML tags and meta-

data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8 Process-flow overview of Side-effect Extraction . . . . . . . . . . . . . . . . . . . 28

9 Heirarchy of Lucene classes used in creating index list . . . . . . . . . . . . . . . 32

10 Sample screenshot showing the various features represented for drug Xanax . . . 41

11 Sample Pie chart representing the distribution of side-effects for Xanax . . . . . 43

12 Pie Chart distribution for Drugs reporting Dizziness as a side-effect . . . . . . . 44

viii

List of Tables

1 Sample websites chosen to create medical and non-medical data domain . . . . . 37

2 Sample subset of phrases with corresponding Google count from medical and non-

medical web domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3 Top 20 side effects reported by patients for Xanax and its corresponding frequency

percentages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Sample reverse mapping from Side-effect -¿ Drug for Dizziness . . . . . . . . . . 44

5 Sample drugs and its top side-effects which are indicators of issue 1 faced . . . . 46

6 Top 10 side-effects of Zoloft and their corresponding Pf-scores . . . . . . . . . . . 50

7 Top common side-effects for drugs under Anti-Depressants category . . . . . . . 51

8 Representation of other major side-effects of the drugs in Anti-depressants category 51

ix

1

1 Introduction

1.1 Problem Scope

Todays generation is often termed as the Digital era. Before the advent of Internet,

a traditional doctor-patient relationship was considered the most reliable source of

information to a patient. However, recent studies show a shift in the role of the

patient from passive recipient to active consumer of health information, with the

Internet acting as the catalyst of this shift. Patients no longer solely depend on

certified doctors for treatment advice.

The use of the Internet as a source of medical information has become increasingly

popular as more patients go online. According to a recent United States survey,

52 million adults have used the World Wide Web to obtain health or medical

information. Back in 2005, an estimated 88.5 million adults were using the Internet

to research health information and/or health-related products and to communicate

with providers. This number has increased leaps and bounds today. Access to large

amounts of medical information is available through an estimated 20,000 to 100,000

health-related Web sites.

In another such social experiment, of the 1289 patients participating, 65% reported

access to the internet; age, sex, race, education, and income were each significantly

associated with internet access. It was found that a total of 74% of those with

access had used the Internet to find health information for themselves or family

members. This clearly is a huge shift from conventional ways where the patient-

provider relationships will probably change, and medical providers will face new

challenges as patients obtain health information from the Internet, share only

some of this with their physicians, or potentially turn to the Internet instead

of consulting a health care provider. The World Wide Web, in all its vastness,

has contributed to extensive rise of online forums and discussion groups that has

2

changed the perspective of these ’e-patients’, thereby responsible for this paradigm

shift.

1.2 Problem Statement

As patient increasingly turn to internet for advice, the number of websites provid-

ing such information is also growing exponentially. More often than not, this galore

of information is unstructured and overwhelming to an average internet user. This

Masters thesis project therefore aims at gathering and aggregating this wealth of

information available on the web, build useful associations and present interesting

observations and numeric validations, all in a user-friendly interface.

Our main focus is was to gather as much data as possible about various drugs

and medications available in the market and design an algorithm to automatically

extract the side-effects reported by each of it, solely based on patient testimonials.

The goal is to provide an unbiased/non-opinionated aggregation of information. In

the process, we have also focused on developing a friendly user interface to present

our observations and interesting results in a manner which is most appealing to

the general audience. Following are some of the major contributions of this thesis

undertaking:

• A side-effect extraction algorithm focusing on Most-frequent metric rather than

the regular Most serious consideration.

• Building association between various drugs and its side-effects in a manner

most relevant for presentation.

• Creating distribution models based on the associations to show top side-effects

for each drug and the reverse mapping of top drugs reporting a particular

side-effect.

• Defining a metric called Profoundness Score to perform comparative analysis

3

between various drugs and categories of drugs.

Apart from the above, we also discovered a set of drawbacks while developing this

kind of a system. While we provide solutions to a couple of challenges, we believe

that some of the setbacks act as foundations for future research on this topic.

1.3 Related Work

Terminology mining, term extraction, term recognition, or glossary extraction,

is a subtask of information extraction. The goal of terminology extraction is to

automatically extract relevant terms from a given corpus. There are mainly two

categories of approaches in Term Extraction:

• Linguistic approach

• Statistical and Machine Learning approach

Early approaches to automatic term extraction were focused on information-theoretic

approaches based on mutual information in detecting collocations [Manning and

Schuetze1999]. Collocations are expressions that are composed of two or more

words, the meaning of which is not easy to guess from the meanings of the compo-

nent words. There are nuances in the detection of collocation that require linguistic

criteria to resolve [Justeson and Katz1995].

It is a common practice to extract candidate terms using a part-of-speech (POS)

tagger and an automaton (a program extracting word sequences corresponding to

predefined POS patterns). Usually, those patterns are manually handcrafted and

target noun phrases, since most of the terms of interest are noun phrases [Justeson

and Katz, 1995]. Typical examples of such patterns can be found in [Jacquemin,

2001]. As pointed out in [Justeson and Katz, 1995], relying on a POS tagger and

legitimate pattern recognition is error prone, since taggers are not perfect. This

might be especially true for very domain specific texts where a tagger is likely to be

4

more erratic. Another key problem is that of nesting, where subsets of consecutive

words of terms consisting of multiple words would satisfy the statistical criteria for

termhood”, but they would not be called terms.

Secondly, purely statistical systems [Church & Hanks, 1990; Dunning, 1993; Smadja,

1993; Shimohata, 1997] extract discriminating multiword terms from text corpora

by means of association measure regularities. Although highly effective, these

methods depend on training data sets with manually identified terms or patterns.

Apart from these two, hybrid methodologies [Enguehard, 1993; Justeson, 1993;

Daille, 1995; Heid, 1999] define co-occurrences of interest in terms of syntactical

patterns and statistical regularities. However, by reducing the searching space to

groups of words that correspond to pre-defined syntactical patterns, such systems

do not deal with a great proportion of terms and introduce noise in the retrieval

process.

Since we deal with a very specific domain of Patient testimonials driven medical

data corpus, our aim was focused at extracting side-effects from the reviews with-

out a standard list of symptoms collected from various sources. Purely linguistic

techniques would not be effective either since we do not restrict out term extraction

to specific parts of speech (Nouns, adjectives or verbs). Also, most of the existing

methods are restricted to either unigram or at the most Bigram words in termi-

nology identification. We believe that going beyond that restriction would yield a

set of interesting and rare side-effects reported by patients that can be valuable.

The rest of the thesis is organized as follows: The next section presents related

work in the area of Automatic Terminology extraction in other domains. Chapter

2 explains in detail our data collection process, the methodology and tools used

for the same. In chapter 3, we discuss the data model created from the collected

data and the related associations built from it. This forms the foundation for all

the next phases of the project. The most important section is dealt in chapter

5

4 where the actual Side-effect extraction algorithm is discussed, followed by our

distribution model. Chapter 5 deals with some of the challenges faced and the so-

lutions implemented to tackle them. Finally, we provide some concluding remarks

and future work in chapter 6.

6

2 Data Collection

The wealth of any mining model is the corpus of data. Since the early 90s, the

World Wide Web has grown to become one of the largest repositories of human

knowledge. It is an inter-connected document network of content conforming to

different formats, topics and types. Although the web data is highly heterogeneous,

unstructured and often redundant, it is one of the most commonly used reference

repository, mainly because of its broad variety of information and easy accessibility.

Since our work is based on user reviews of various medications, we mainly target

online discussion forums as our primary source of data.

An online discussion forum is a web community that allows people to discuss com-

mon topics, exchange ideas, and share information in a certain domain, such as

sports, movies, medicines, politics, travel, cars and so on. In most of these forums,

users either start new threads to begin discussion or reply to existing threads.

Large repositories of archived threads and reply records in online discussion fo-

rums contain a great deal of human knowledge on many topics. Although forums

contain rich first-hand information from the reviewers, it is noteworthy that they

are highly unstructured and do not conform to any fixed grammatical or syntacti-

cal standards of language. We aim to collect such user-oriented data and extract

useful information from it in straightforward, yet highly effective manner.

The data collection phase involves three major tasks, namely:

• Choice of websites

• Web Crawling

• Parsing of crawled files

All three stages are of equal importance as they form the basis of our data model.

We examine each of these in a greater detail as follows.

7

2.1 Choice of Websites

Choosing the right websites to collect information is of at most importance as it

determines the quality of data used in the experiments and also has a huge impact

on the results observed. As internet websites are the source of our data, there were

few main considerations in choosing the websites to crawl, namely:

1. Contain User testimonials

2. Large volume of the sites

3. High traffic to the sites

Based on the above factors, the following are some of the sites which were crawled:

• www.drugs.com

• www.medications.com

• www.askapatient.com

• www.dailystrength.org

• www.rxlist.com

• www.drugratingz.com

The main consideration is to find sites which carry patient testimonials. This is in

accordance to our goal to determine the side-effects experienced by most patients

rather than what is informed by the drug manufacturing companies. Therefore,

we surveyed the internet extensively to find sites conforming to this requirement.

The second consideration is to pick sites which are larger in volume and have more

pages which are indirectly an indication of how much data the site collects and

activeness of the site in general. To determine the volume of the site, we perform

a simple Google site-search and get a rough estimate of the number of pages in

that site. Although not all the pages carry patient testimonials, it is still a good

metric to determine the volume of the site. Some examples are:

8

• www.medications.com: 187,00 pages

• www.drugs.com: 554,000 pages

• www.askapatient.com: 258,000 pages

• www.drugratingz.com: 100,000 pages

• www.rxlist.com: 320,000 pages

Finally, as a last factor, the sites traffic estimate per month from Quantcast was

used to determine the sites with high popularity among internet users. Sites with

higher traffic also imply that the content is more recent and fresh. Quantcast is

a media measurement service that lets advertisers view audience reports inclusing

traffic, demographics, geography, site affinity and categories of interest on millions

of websites and services. Some examples for the sites of our interest include:

• www.medications.com: 65,000 hits/month

• www.drugs.com: 7.2M hits/month

• www.askapatient.com: 352,000 hits/month

• www.drugratingz.com: 16,500 hits/month

• www.rxlist.com: 3.5M hits/month

Therefore, considering all the above factors, a set of 12 websites were chosen and

crawled for patient testimonials. The upcoming sections describe the next steps in

details.

2.2 Crawling

2.2.1 What is Web Crawling?

A Web crawler is a computer program that browses the World Wide Web in a

methodical, automated manner or in an orderly fashion and bulk downloads web

9

pages. Other terms for Web crawler are ants, automatic indexers, bots, Web spiders

or Web robots. This process is called Web crawling or spidering. Many sites, in

particular search engines, use spidering as a means of providing up-to-date data.

Web crawlers are mainly used to create a copy of all the visited pages for later

processing by a search engine that will index the downloaded pages to provide fast

searches. Crawlers can also be used for automating maintenance tasks on a Web

site, such as checking links or validating HTML code. Also, crawlers can be used

to gather specific types of information from Web pages, such as harvesting e-mail

addresses (usually for spam).

Predominantly, crawlers today are used for the following purposes:

• They are one of the main components of web search engines, systems that

assemble a corpus of web pages, index them, and allow users to issue queries

against the index and find the web pages that match the queries.

• Web archiving where large sets of web pages are periodically collected and

archived for posterity.

• Web data mining, where web pages are analyzed for statistical properties, or

where data analytics is performed on them.

Our work is one of the examples related to the third category where Crawlers

collect data necessary for extracting useful information from the dataset.

In general, a crawler starts with a list of URLs to visit, called the seeds. As the

crawler visits these URLs, it identifies all the hyperlinks in the page and adds

them to the list of URLs to visit, called the crawl frontier. URLs from the frontier

are recursively visited according to a set of policies. The policies specific of this

project shall be discussed in the following paragraphs.

This figure depicts a typical Web Crawler architecture which schedules, queues

and downloads web pages from the internet cloud.

10

Figure 1: Basic web-crawler architecture

2.2.2 HTTrack Web-Crawler

The crawler used in the Data collection phase of this project, is a free and open

source Web crawler and offline browser, developed by Xavier Roche and licensed

under the GNU General Public License called HTTrack. It allows one to download

World Wide Web sites from the Internet to a local computer. By default, HTTrack

arranges the downloaded site by the original site’s relative link-structure. The

downloaded (or ”mirrored”) website can be browsed by opening a page of the

site in a browser. HTTrack can also update an existing mirrored site and resume

interrupted downloads. HTTrack is fully configurable by options and by filters

(include/exclude), and has an integrated help system. HTTrack uses a Web crawler

to download a website. Some parts of the website may not be downloaded by

default due to the robots exclusion protocol unless disabled during the program.

HTTrack is very flexible and provides a wide range of options which can be set

to suit the requirements. Since sites are usually massive, these options come in

quite handy to allow crawling of selected pages which carry the data needed. If

you ask it to, and have enough disk space, it will try to make a copy of the whole

11

Figure 2: A sample screenshot demonstrating the first step in HTTrack Crawler where the websiteto be crawled and action to be performed is specified.

Internet on the local computer. Hence it is important to understand and use these

options provided by the Crawler in the right manner. The following section briefly

describes some of the features of HTTrack which have proved useful in our work.

2.2.3 Gathering Data for Sideffective

The first step is to identify the websites which need to be crawled. The in-

ternet hosts a galore of medical websites, both government approved and non-

governmental third-party ones. But since we were interested in only user gener-

ated forums, the web had to be massively scanned to hand-pick a list of sites which

hosted discussion forums on different medications.

The next step is to observe URLs of each site and its corresponding forums, and

identify patterns in these URLs to be given as input to the HTTrack. This software

also provides the flexibility of choosing different action options including

• Download websites

• Continue interrupted download

• Update existing download

12

The last option is proved particularly very crucial, since websites are dynamic

in nature and get updated constantly. Periodically running the crawler on this

action option gives the advantage to updating the content and maintaining con-

tent freshness. The engine will go through the complete structure, checking each

downloaded file for any updates on the web site. Figure 2 shows a sample screen

snapshot of this actions page.

The next step involves setting a range of options according to the project require-

ments. Some of the important options are discussed below:

a Proxy Options: The engine can use default HTTP proxy for all ftp transfers.

Most proxies allow this, and if you are behind a firewall, this option will allow

you to easily catch all ftp links. Besides, ftp transfers managed by the proxy

are more reliable than the engine’s default FTP client. This option is checked

by default.

b Scan Rules: Filters (scan rules) are the most important and powerful option

that can be used: one can exclude or accept subdirectories, skip certain types

of files, and so on.

c Limits Options: A very important section which can set maximum mirroring

depth, maximum external depth, maximum transfer rate, maximum connec-

tions per second and a whole set of other such options.

d Flow Control Options: This is used to set number of connections, timeout,

retries, minimum number of connections etc.,

e Spider Options: This can be set to accept cookies, follow or not follow robots.txt

rules, check document types etc.,

The above mentioned are only some of the most important options while there

are numerous other settings which can be modified as well like Log, Index, Cache

options, MIME options, Browser ID options, Build and Link options.

13

Figure 3: A sample screenshot demonstrating the options screen of HTTrack

Finally, once all the required options are set, the Crawler begins to download the

pages to the specified folder on the local machine. The project download can be

aborted anytime by hitting the cancel button. Also, the download can be resumed

by starting the project again and picking * Continue interrupted download from

the menu on the Mirroring Mode page.

2.3 Parsing HTML

2.3.1 Why parse raw files?

The output from a crawler is massive amounts of raw HTML files. These files

contain HTML tags and a lot of other metadata. To make this usable, these files

have to be processed and cleansed to extract the required information and also

structure them to conform to a data model.

Firstly, it is useful to understand the structure of the HTML pages which have

been crawled. Different sites use different layout architecture on their websites.

14

Figure 4: Sample screenshot demonstrating the final step of HTTrack web crawler

As an example, the webpage snippets represented in Fig 5 Fig 6 show a portion of

the webpage of two different sites which we have crawled.

The snapshot shown in Fig 5 is merely a section of the entire webpage. The page

contains a lot of other site specific data which is not useful for our consideration

and hence must be eliminated. Another such example is provided in Fig 6 which

is from a different site (www.askapatient.com) and has an entirely different layout

structure to organize the patient reviews.

The challenge therefore, lies in taking these raw HTMLs and converting them into

usable format, irrespective of the site specific layout structures. For this purpose,

we have used Javas HTML parser library which provides a wide range to options

to parse these files. The next section briefly deals with the Java library and its

functionalities.

15

Figure 5: A portion of the webpage on the site: www.medications.com from which user reviewshave been extracted.

Figure 6: A portion of the webpage on the site: www.askapatient.com from which user reviewshave been extracted, with a different page structure.

16

2.3.2 HTML Parser JAVA Library

HTML Parser is a Java library used to parse HTML in either a linear or nested

fashion. Primarily used for transformation or extraction, it features filters, visitors,

custom tags and easy to use JavaBeans. It is a fast, robust and well tested pack-

age. The two fundamental features provided by this library are: Extraction and

Transformation. Extraction encompasses text extraction, link extraction, screen

scraping, resource extraction, line checking and site monitoring, while Transfor-

mation URL rewriting, site capture, censorship, HTML cleanup, ad removal and

conversion to XML. Transformation preserves the output file format as HTML

while Extraction does not.

The library provides a HTML Lexer and a HTML Parser API. The lexer provides

low level access to generic string, remark and tag nodes on the page in a linear,

flat, sequential manner, whereas the parser provides access to a page as a sequence

of nested differentiated tags containing string, remark and other tag nodes. The

parser attempts to balance opening tags with ending tags to preserve the structure

of the page, while the lexer simply spits out nodes. As our application requires

knowledge of the nested structure of the page, we use the HTML parser to generate

clean text files.

The code snippet in Fig 7 shows the usage of the Parser class:

The lines 1 - 4 iterates over all the html files from the crawler and performs a

preliminary check to verify if each is indeed a file or a directory. Lines 5 10 reads

each file and appends every line to a string. Line 11 13 is the part where the

HTML Parser Library is used to create a new Parser object using the html string.

HtmlPage object is created using the parser object of line 11. Finally, the parser

executes the visitAllNodesWith() method which does a Depth First Traversal of

each node in the page to eliminate the html tags.

17

Figure 7: Code Snippet demonstrating the Parser class to eliminate HTML tags and metadata

Therefore, by the end of this phase, the following goals were met:

• Identified websites with large repositories of user generated reviews for drugs.

• Used HTTrack Crawler to gather all web pages in these identified sites which

carry the user testimonials.

• Implemented Javas HtmlParser API to parse raw HTML files, thereby gener-

ating a massive repository of clean text files.

The following sections describe at length, the subsequent steps involved in post

processing this data even further to create a data model.

18

3 Data Model and Associations

In the previous section we discussed at length the process of data Crawling and

Parsing. In this chapter we examine the next logical steps involving post-processing

raw text files and creating database tables using them. This data model forms the

building block for the User Interface and its features which are explained in the

next chapter.

There are two main categories of discussion in creating the data model viz.,

• Extraction of information for Database model

• Creating Associations and Dependencies

We discuss each of these in the following sections.

3.1 Extraction of useful data

Data for this work has been collected from more than 10 medical review websites

on the internet. As illustrated in fig.5 and fig.6, each website has a different

representation of the data it carries. This makes it very difficult to run a common

routine to extract the necessary information from the webpage. The extraction

routine has to be customized specifically for each website after careful examination

of the data organization structure that it holds.

We first examine a few examples of raw text files from 3 different websites. The

methodology for creating database tables from them is described based on these

examples. The following are some sample snippets:

• www.askapatient.com

• www.drugratingz.com

• www.medications.com

19

20

21

In each of these snippets, the relevant portion is only the part where patients

describe their experience with the drug. In the first example, the way the website

is structured; askapatient.com represents a set of fields like Ratings, Reason, Side

effect, Comments, Sex, Age, Time Taken and Date Added. Of all these fields,

we identify and extract only the field corresponding to side-effect and comments.

Observing the pattern in the second snippet taken from drugratingz.com, the site

represents 4 different ratings followed by the patient testimonial, 2 links and Date-

added fields. In the final example from medications.com, the website presents

Date, Year, Time, Review and Link. Additionally, it is important to note that the

examples presented above are only snippets from the actual raw text file. Each file

therefore, carries a lot of metadata about the site before and after these snippets.

It is essential to clean this metadata information as well. Finally, it is essential to

correlate each raw text file to the drug name associated with it.

The aim of this phase is to create a relevant Database model based on the infor-

mation extracted from these crawled websites. The first table we create, which is

22

the Master table has the following specification:

drugname varchar(50): As the name suggests, this holds the name of the drug

for which the review corresponds.

user varchar(150) : Most websites which were crawled for the data were forums

wherein the users had a unique username. This field does not carry a special

functionality and has been included just for the sake of ease of implementation.

link varchar(500) : This field carries the link to the website from which the

particular review has been extracted. A small detail to be noted here is that, the

links in case of some reviews may not match now as the users and administrators

are constantly updating the websites, thereby posing the problem of outdated links.

review varchar(2000) : This field is the most vital attribute as it has the actual

testimonial written by the user for the specific drug. Noteworthy feature here is

that the text is highly unstructured with no specific format. Therefore, further

cleansing the data sufficiently is very important before using it for any analysis.

Considering all the above factors, the following steps were performed to translate

the text files to the database table described above.

Step I: Identify a fixed recurring pattern in text files corresponding to each website.

Step II: Extract only the recurring patterns into an array, say A. Based on the

website, copy the fields in the array corresponding to the user testimonials, website

links and user name (if applicable).

Step III: Final step is to identify the location of the Drugname specified in each

file, which is usually in the HTML header of the webpage.

Step IV: Once all the fields are identified, insert them into the database Master

table. After our experiments, the master table contained nearly 400,000 unique

patient reviews collected from over 10 websites with information about nearly 2500

Drugs. This table forms the foundation for the next steps of creating associations

23

and dependencies between the drugs and its side effects.

3.2 Creating Association and dependency models

The stages till now describe the way websites were processed to extract webpages,

parse HTML to extract text files and further subject the raw text files to certain

site-specific routines to create the Master Table for the Drug database. In this sec-

tion, we discuss the way we use this Master table to define more useful associations

which are Drug-centered in nature. At the end of this phase, we aim to achieve a

very structured Database model that fuels our Graphical User Interface in a way

that is most efficient as well as site-user friendly. First, we describe the structure

of the Database tables desired. Next we describe the routine for populating these

tables.

3.2.1 Defining Database Tables

Any web-based application is supported in the backend by a database schema

holding all the data. It is of utmost importance to design these tables in a way

that it most efficient for querying and retrieval. Some of our design considerations

were:

• Ease of understanding the schema

• Ease of querying the tables

• Simplicity of the schema

• Striking the right balance between static and dynamic calculations

24

Keeping in mind the above considerations, the schema consists of the following

tables:

UnigramSE This table has the various drugs and correspondingly, the different

unigram side effects found in their user reviews. This table also has the frequency

of each unigram for each drug. The structure of the table is as follows:

DrugName: varchar(200) : This is the field with the name of the drug.

Effect: varchar(200) : This field holds the bigram side effect itself .

Frequency: int(50) : Frequency stands for no. of times a particular bigram has

been mentioned as a side effect in context of a particular drug.

NGramSE This table is very similar to the UnigramSE described above. Just

like unigrams, this table holds n-gram side effects.

DrugName: varchar(200) : This is the field with the name of the drug.

Effect: varchar(200) : This field holds the n-gram side effect itself.

Frequency: int(50) : Frequency stands for no. of times a particular n-gram has

been mentioned as a side effect in context of a particular drug.

Categories This table is a simple collection of records with just 2 fields.

DrugName: varchar(200): This represents the name of the medicine

Category: varchar(200): This represents the class to which the drug belongs. Ex:

Anti-depressants, Analgesics, Beta blockers etc.

3.2.2 Populating database tables

Having defined the tables, now we examine the way these tables are populated with

relevant data. The UnigramSE and NGramSE tables are populated by iterating

over the Master Table and identifying the side-effects mentioned for each drug in

every testimonial. This process requires a list of all side-effects which can be used

25

to cross-reference across the reviews. The way this fixed list is built is of atmost

importance to his work and has been discussed at length in Chapter 5. But for

the purpose of discussion in this section, we assume that we are provided with a

nearly comprehensive list of all possible side-effects, say List L.

The module fetches all the rows of (drugname, review) from the Master table and

iterates over every row processing the review to extract medical terms/side effects.

The first step involves removal of words that do not add any semantic value to

the sentence. For this purpose, there is a stop list (%stop list) of words, which is

used to eliminate the common words of English, which are definitely deemed not

medical.

There is another list of words, List L which are a collection of medical side effects

and symptoms collected using an extraction algorithm described in later sections.

This list is used as a reference against each review to bring out the drug to side

effect mapping for that drug. A drug to side effect mapping in a particular review

is counted only once even though it might contain multiple occurrences of the same

mapping over and over again.

Only n-gram side effects with n=4 are considered in all experiments.

• The mapping with unigrams in them are stored in the table UnigramSE.

• The mapping with n-grams (n=4) in them are stored in the table NGramSE.

Now to look at the methodology:

For each review, the first step is to remove all the punctuation marks, convert all

letters to lower case and remove leading spaces. This is done using the regular

expressions feature of Perl programming language. The next loop considers every

word of this review to check if it a stop-list word or not. If it is, then it is ignored.

If not, then the first sub-step checks to see if this word exists in the unigram list.

If it does, then the next check is to see if the word already exists in the table. If

26

it does, then it is ignored else, this word is inserted into the tabledsf table with

the frequency as 1. The second sub-step is the exact same procedure for bigrams

to insert into the bitable. The only difference here is, every two consecutive words

are considered instead for single words. The above procedure takes place every

time a new Drug row appears into consideration.

Other than the first occurrence of a Drug, all other rows are handled a little dif-

ferently. The initial clean up and pre-processing remains the same. The difference

is, every word is checked to see if it has been already inserted from any previous

review. If yes, then the second check is to see if this word has already appeared in

the same review. If yes, then it is ignored. If no, then the frequency is updated by

incrementing by 1. The same procedure takes place with bigrams as well although

two consecutive words are considered at a time.

Finally, the Categories table is populated by crawling a couple of sites providing

information about Drug Categories. This table is quite simple in its structure and

has no hidden complexities involved in building it. It was built to cater to a specific

feature of the User interface involving Drug Categories and comparisons, which is

discussed in Section 6.

27

4 Data Corpus Harvesting

In this section, we discuss in detail the main contributions of this thesis work,

including Side-effect Extraction without using fixed-lists, mining domain specific

knowledge with respect to each Drug and each side-effect reported by patients in

their testimonials.

4.1 Side effect discovery

Terminology extraction is to automatically extract relevant terms from a corpus.

The extracted terms are used to build Domain-specific ontology and associations

also acting as the data dictionary for the later steps. In general, techniques for

term extractions involve either constructing Linguistic rules based on the corpus

or statistical metrics evaluating the probability of a phrase being a valid term in

context. The other approach is using Machine Learning algorithms where a small

set of terms are manually validated and used as training set to classify other terms.

In this project, we aim to extract a unique list of Side-effects, purely derived

from patient testimonials without using a standard list of FDA/NIH specified

side effects. Our approach uses a simple but effective search engine validation

technique to separate medical phrases from non-medical phrases. We also target

n-gram phrases of size up to 4, rather than just unigrams and bigrams in an effort

to build a more comprehensive set of side effects.

The following figure demonstrates the overview of the process flow:

The following are the phases in building Side-effect library from the medical data

corpus:

• Index all patient testimonials to create Master List of n-grams

• Apply preliminary filters to preprocess the master list of terms

28

Figure 8: Process-flow overview of Side-effect Extraction

• Apply algorithm of Google Search Validation to extract medical terms from the

corpus

We discuss each of these steps in detail in the subsequent sections.

4.1.1 Building n-grams

The data corpus has close to 400,000 patient testimonials. It is therefore important

first step to build an index spanning over the entire set. Before we delve into the

details of building the n-gram index, it is useful to understand a few key concepts

as described below:

N-Gram: An n-gram is a subsequence of n items from a given sequence of items.

The items in question can be anything, letters, numbers or words, though most

commonly n-grams are made up of character or word/token sequences. By this,

we can say that the sub-sequence experience severe from the sequence I experience

severe muscle cramps is an n-gram.

Different kinds of n-grams have received their own notations:

• Unigram: A unigram is when the n-gram size is 1. That is, there is only one

item in the sub-sequence.

• Bigram: A bigram is when the n-gram size is 2. Following the previous pattern,

29

there is only two items in the sub-sequence.

• Trigram: A trigram is when the n-gram size is 3. As previous, there is now only

three items in the sub-sequence.

• N-grams: An n-gram is when the n-gram size is 4 or above. That is, there are 4

or more items in the sub-sequence.

Shingles: A shingle is just a word-based n-gram, as opposed to a character-based

n-gram. They are widely used to create pseudo-phrases during the indexing process

since the shingle ends up being a single token, which is then subject to the normal

TF-IDF scoring. In many cases, searching for phrases yields relevance improve-

ments, but finding phrases at query-time can be more expensive than normal term

queries, so in such cases it is a common practice to use shingles. Non-trivial word

n-grams (aka shingles) extracted from a document can also be useful for document

clustering.

Apache Lucene: It is an open-source search and indexing technology framework,

and has become quite popular within the last couple of years. In the initial stage,

only one implementation existed in the Java programming language. There now

exist several ports to other programming languages including PHP.

Apache Lucene is not a single search application which can be installed and ex-

ecuted. It is a complete search engine library which contains all the necessary

functions to both index and search a document collection. With this library one

can create a search engine which complies with user defined requirements and other

special needs. The library is very simple to use, and is despite its simplicity very

fast and efficient. The library provides out-of-the-box default values so that one

can quickly be able to index and search. However, it is quite simple to extend the

library to enhance indexing performance, search domain, document analyzers, etc.

A few quick definitions in the context of Lucene are as follows:

30

Tokenizer Tokenization is the process of breaking a stream of text up into mean-

ingful elements. A class/function that parses an input stream into tokens is called

a Tokenizer.

TokenFilter A TokenFilter is a TokenStream whose input is another token stream

and it applies a set of rules to the stream to get a resultant desired output con-

forming to a specific format.

Analyzer An Analyzer builds TokenStreams, which analyze text. It thus rep-

resents a policy for extracting index terms from text. Typical implementations

first build a Tokenizer, which breaks the stream of characters from the Reader into

raw Tokens. One or more TokenFilters may then be applied to the output of the

Tokenizer.

Lucenes n-gram support is provided through NGramTokenizer class which tok-

enizes an input String into character n-grams and can be useful when building

character n-gram models from text. NGramTokenFilter, EdgeNGramTokenizer

and EdgeNGramTokenFilter are the other supporting classes providing the same

functionality. On the other hand, word n-gram statistics or model is built using

ShingleFilter or ShingleMatrixFilter classes. ShingleAnalyzerWrapper wraps the

ShingleFilter around another Analyzer.

Based on the above concepts, the patient testionials are subject to certain routines

to yield the comprehensive index of all the phrases in the data corpus. The program

which builds the index has the following specifications:

a Input: Nearly 400,000 patient testimonials read one after the other in a loop.

b Processing: This step involves three main tasks, namely

• Create an IndexWriter Object. An IndexWriter is the one which creates

and maintains the index. The constructor of the class determines whether

to create a new index or update an existing one.

31

• Iterate over each testimonial and create a Document object from it. Doc-

uments are the unit of indexing and search. A Document is a set of fields.

Each field has a name and a textual value. A field may be stored with the

document thereby uniquely identifying each document.

• The last step is to actually create the index using various Analyzers and

filters configured according to the requirement.

This final step is architected using a set of Lucene Analyzers and filters de-

scribed as follows:

ShingleAnalyzerWrapper wraps a ShingleFilter around another analyzer. In

our case, the analyzer we chose is a Standard Analyzer. The Shingle Wrapper

is used to specify the Maximum Shingle size, which for the purpose of our

experiments is size = 4.

StandardAnalyzer filters StandardTokenizer with StandardFilter, LowerCase-

Filter and StopFilter, using a list of English stop words.

LowerCaseFilter normalizes token text to lower case.

StopFilter removes stop words from a token stream.

StandardFilter normalizes tokens extracted with StandardTokenizer.

StandardTokenizer: A grammar-based tokenizer constructed with Jflex. This

should be a good tokenizer for most European-language documents:

• Splits words at punctuation characters, removing punctuation. However, a

dot that’s not followed by whitespace is considered part of a token.

• Splits words at hyphens, unless there’s a number in the token, in which case

the whole token is interpreted as a product number and is not split.

• Recognizes email addresses and internet hostnames as one token.

As an example, the sentence The quick brown fox jumps over the lazy dog

32

Figure 9: Heirarchy of Lucene classes used in creating index list

would yield the following tokens:

Unigrams: quick, brown, fox, jumps, over, lazy, dog

Bigrams: quick brown, brown fox, fox jumps, jumps over, over lazy, lazy dog

Trigrams: quick brown fox, brown fox jumps, jumps over lazy, over lazy dog

4-gram: quick brown fox jumps, brown fox jumps over, fox jumps over lazy,

jumps over lazy dog

The figure 9 shown below depicts the overall class hierarchy of the Lucene

modules and their associations.

c Output: The output from this stage is two lists:

• Term list without Stop-Words consisting of nearly 7 million terms

• Term list with Stop-Words consisting of nearly 15 million terms

4.1.2 Preliminary Filtering of phrases

The output from the first phase generates a Master list of terms which belong

of all categories i.e., not restricted to only the medical terminology. Due to the

massiveness of the master lists, we subject them to a set of preliminary filters to

33

reduce the index set. This facilitates a better implementation of our extraction

algorithm and also helps is more effective evaluation of the final list of medical

terms. The two preprocessors used in our experiments are discussed below:

a Top Frequency Filter: The first filter that we use is to reduce the size experi-

mental data set. Lucenes Analyzer not only indexes the testimonials, but also

calculates the frequency of occurrence of each term in the document corpus.

This frequency metric is used to filter the Master lists. The rule to apply the

filter was to select only those terms which had a minimum support count of

100. This step is an effort towards fulfilling one of our primary contribution

or aim in this work, which is:

Sideffective, unlike other medical data miners, focuses on bringing forth the

Most-Frequent side-effects of a drug, rather than the most serious/non-serious

side-effect.

In accordance to this, a side effect like Headache which occurs 75,000 times

in all the testimonials is given higher ranking than a very rare side-effect like

Death which is reported by only one patient. We believe that users of this

system would be more interested in knowing about the most frequent, thereby

the most common effects of taking a drug, rather than an effect which was

experienced by only one rare patient.

The top 25,000 phrases of each of the lists (with and without stop words)

were chosen as the output of the first filter. It is a noticeable fact here that

the list without stop words was a good source of Unigram and Bigram side

effects while the list with stop words was the source for n-gram side effects.

This is a fairly obvious conclusion due to the grammatical rules of the En-

glish language which governs that when trigrams or n-grams are formed using

Nouns or Verbs, they are connected together by prepositions or conjunctions

and preceded by adjectives. These grammatical rules are also the building

34

blocks for the next filter applied discussed in the following paragraphs.

b Semantic Filter: The main aim of this extraction process is to identify side ef-

fects automatically from the data corpus. The index built by Apaches Lucene

with the corresponding term frequencies no doubt contains a lot of noise.

Even the list which has been filtered with Stop Word Analyzers does not

necessarily provide a 100

For example, if we consider phrases like in the morning, I am now, doctor

prescribed this, they are semantically correct phrases but do not contribute

towards extraction of side effects. Therefore, it is necessary to eliminate such

contextually meaningless terms. The rule of the filter is:

A phrase is considered contextually meaningless if and only if the phrase

begins or ends with a stop word, and therefore eliminated from the index list.

The reason the filter is built around this rule is, side effects more than bigram

fall under either of these categories:

i Adjective followed by one or more nouns

ii Nouns and verbs joined by preposition or conjunction

The following are some examples of category (i):

Lowered Blood pressure

Severe joint pains

Decreased sex drive

Extreme mood swings

While examples of category (ii) are like the following:

Ringing in ears

Stiffness in neck

35

A third set of phrases which do not conform to either of the above two cat-

egories are the ones which either begin or end with conjunctions, prepo-

sitions or in general, any Stop Words. This filter does the task of elimi-

nating such phrases. Some examples of such phrases are:

in the morning

am so

time and i am now

The output from this filter is a set of clean phrases, both medical and non-

medical. The non stop-word list consists of nearly 8000 phrases while the

one with stop words is left with around 15,000 phrases. The next section

describes the term extraction algorithm formulated to pick out unique

side-effects from the index list.

4.1.3 Term Extraction Algorithm

This is the final step of the side-effect identification algorithm. The process

begins with a Master index which is a massive list of phrases, followed by

applying a couple of filters which effective clean and reduce the size of the

experimental data set. The final step is to separate the medical phrases from

the non-medical phrases, thereby yielding a subset of unique Side-Effects.

The methodology employed for this separation process is a simple Google

Search Engine results based technique. Here we examine a couple of Google

engine based concepts that are useful in this context.

Number of results:

When a Google search is performed on a search query, the results are often

displayed with the information: Results 1 - 10 of about XXXX. This XXXX

number is the estimated total number of results that exist for that query. The

36

estimated number may be either higher or lower than the actual number of

results that exist. Google’s calculation of this total number of search results

is only an approximate ballpark figure and is provided only to make search

faster rather than calculating the exact number. But we believe that this

number is quite valuable when used in conjunction with Googles Site Search

feature which thereby gives a fair idea of the frequency of occurrence of a

term in any domain of websites.

Site specific search:

Google allows one to specify that the search results must come from a given

website. For example, the query [nausea site:www.drugs.com] will return

pages about nausea but only from drugs.com. The simpler queries like [nausea

from Drugs.com] will usually be just as good, though they might return results

from other sites that mention the Drugs.com in association with nausea. One

can also specify a whole class of sites, for example [nausea site:.gov] will return

results only from a .gov domain. It is also possible to search in multiple

domains simultaneously using the keyword OR. Therefore a query that says

[nausea site:drugs.com OR site:medications.com] returns result from either of

the two sites corresponding to search key nausea.

Based on these concepts, we define our side-effect extraction routine as follows:

Separate medical from non-medical terms by counting the frequency of occur-

rence of each term in a domain of medical websites (X) VS purely non-medical

websites (Y), and thereby eliminating the ones where Y is significantly less

than X by a factor n.

As discussed before, the frequency of occurrence of a term in any domain is

estimated using Googles search result approximation number. We choose the

domain of medical sites and non-medical sites considering the following three

factors:

37

Medical Websites Domain Non-medical Websites Domainwww.medications.com

www.drugs.comwww.rxlist.com

www.askapatient.comwww.dailystrength.org

www.finalgear.comwww.travelblog.org

www.virtualtourist.comwww.nj.com

Table 1: Sample websites chosen to create medical and non-medical data domain

• Contains User testimonials

• Large volume of the sites

• High traffic to the sites

Based on this, the following sites were chosen in each of the domains:

The medical sites are same as the ones chosen for crawling testimonials. The

non-medical sites were selected such that they contained massive user testimo-

nials to maintain uniformity in the pattern of data and formation of sentences

by users. A sample search query in a medical domain would be:

Muscle cramps site:drugs.com OR site:medications.com OR site:rxlist.com OR

site:askapatient.com OR site:dailystrength.org

Similarly, a query in a non-medical site domain would be:

Some people site:finalgear.com OR site:nj.com OR site:travelblog.org OR site:virtualtourist.com

We tabulate these results for all the phrases collected and by trial and error,

determine the factor n based on which medical phrases are separated from

non-medical ones.

A sample table with about 15 phrases is shown below:

Determining factor n:

According to our rule defined, we extract only those phrases from the list

that have a search count in medical websites by a factor n more than the

one in non-medical web domain. To determine this n, we use a trial and

38

# Phrase Medical Non-Medical1 anxiety attacks 9,370 2502 severe depression 37,100 3813 acid reflux 4,670 8784 lethargic 2,000 4505 lyrica 8,390 1,0306 Depakote 6,680 1207 permanent 9,410 39,5008 got worse 19,200 37,1009 would recommend 26,700 55,20010 doesnt work 7,660 4,08011 taste in my mouth 4,260 3,28012 extremely painful cramps 4,770 1,31013 horrible headaches 4,200 91314 on the pill 181,000 3,76015 is getting worse 17,300 11,300

Table 2: Sample subset of phrases with corresponding Google count from medical and non-medical web domain.

error approach starting with n = 1. We then perform an editorial audit of

the resultant final list to certify the quality of the extracted terms based on

the false positive rate. After several trials, we narrow down on n = 4 being

the best value of this factor. We examine the various n values tried on the

experimental set and the associated problems in the result set in the following

paragraphs.

Factor n < 4 - For smaller values of n, the number of phrases that get filtered

are fewer in number. This makes the resultant final list noisier in nature.

From table 1, it can be seen that using an n-value of 1, 2 or 3 identifies

medical phrases like row # = 1, 2, 3, 4, 5, 6 but at the same time, it also

flags rows like 10 and 11 as medical which is incorrect. The false positive rate

while using smaller n-value is very high.

Factor n > 4 By using a larger n-value, a lot of noisy phrases are eliminated.

This is an advantage although there appears a new problem with the resultant

list. Phrases which are more common in nature i.e., more likely to occur

in both medical and non-medical domain get eliminated. These could be

valuable side-effects which are not expected to be discarded. Classic examples

39

are row 12 and 13 of table 1 whose frequencies in both data sets are very

close. It is not uncommon to find phrases like horrible headache or terrible

cramps in a Travel or Tourism website where users write testimonials of their

experiences.

Considering all the above issues, an n-value of 4 proves to have the least false

positive rate and generated the maximum number of Unique Side Effects.

Despite these advantages, we observe that the final list from this scenario also

has a few noisy phrases which do not categorize as purely medical. The final

list after all the phases and filters contains nearly 2000 terms which fall under

the following 3 categories:

• Side effects

• Drug Names

• Noisy terms

Finally, the following 2 steps are performed to obtain the list of unique side

effects:

Step I: Eliminate drug names by cross-referencing with the drug names from

the database.

Step II: Run a manual audit of the list from previous step to obtain unique

side effects.

4.2 Determining Top side effects

We examined in great detail the side effect extraction process in the previous

section which is our primary contribution in this project. Once the side effects

are extracted, it is important to represent all the valuable data collected in a

manner that is most useful as well as appealing to the target audience, who are

patients using various drugs and medications in day-to-day life. Hence, our second

40

important contribution is aggregation and data distribution representation which

is discussed in the following two sections.

Data aggregation is any process in which information is gathered and expressed

in a summary form, for purposes such as statistical analysis. We use aggrega-

tion and its strength to represent distributional data in a user friendly interface

designed and developed specifically for this undertaking. The details of the sys-

tem implementation are discussed at length in Section 7. As discussed in Section

4, we built database tables for Unigram and N-gram side effects with their fre-

quencies indexed by Drug names in tables UnigamSE and NGramSE respectively.

These tables were built using the side-effects identified from our extraction routine.

Additionally, there exists the Master table called DrugTable which holds all the

reviews/testimonials collected from various websites, also indexed by Drug names.

In this section, we try to answer the lurking question by most patients:

What are the top side-effects reported for a medicine?

For the rest of this discussion, we explain the various features of the distribution

model using the example of drug Xanax and its side effects. There are 3 main

features represented for each Drug:

• Top Side effects with percentage distributions

• Graphical representation of the above distribution

• Patient testimonials in the corresponding category

The screenshot in Fig 10 shows all the above features represented for drug Xanax.

We examine each of these in the upcoming paragraphs.

41

Figure 10: Sample screenshot showing the various features represented for drug Xanax

4.2.1 Section I: Top Side-effects

The UnigramSE and NGramSE table contains all the side effects for a specific

drug. Their frequencies are extracted and the percentage calculated as follows:

If, f = frequency of occurrence of side-effect as reported in the DB table

T = Total number of reviews reported for the particular drug

Then, Percentage frequency PerD(X) where X = any side-effect for drug D is:

PerD(X) = (f/T ) ∗ 100 (1)

Based on this calculation, the top 20 most frequently reported side effects are

displayed in the column I. Table3 shows the Percentage frequency calculated for

Xanax and its 20 most reported side-effects.

42

Side Effect Percentage Frequencydrowsiness 1.03%depression 0.97%

Memory loss 0.74%insomnia 0.65%

fear 0.59%tiredness 0.27%dizziness 0.27%

anger 0.22%Dry mouth 0.22%Weight gain 0.20%

seizures 0.20%sensitive 0.16%nausea 0.16%

Mood swings 0.13%Vivid dreams 0.07%

Increased appetite 0.07%Muscle weakness 0.05%Muscle spasms 0.05%

Heart palpitations 0.05%Chest pain 0.05%

Table 3: Top 20 side effects reported by patients for Xanax and its corresponding frequencypercentages

4.2.2 Section II: Graphical Representation

Visual representations are usually more appealing and easily perceptible to patients

reviewing large volumes of data. This motivated us to represent the top side effects

for a drug in the form of a pie chart on our interface. Pie charts make a good

representation of data when the categories illustrate a proportion of the total.

Following the same example as before, Fig 11 represents the Pie distribution for

Xanax and its side effects.

4.2.3 Section III: User Testimonials

Recent survey conducted on a small group of patients shows that patients are

always looking to read experiences of others who are on the same medication as

themselves to feel more reassured about their own condition. Therefore, the final

section of the webpage is displaying User Testimonials for every drug selected.

These reviews are unfiltered, first-hand information from patients who have used

43

Figure 11: Sample Pie chart representing the distribution of side-effects for Xanax

and experienced side-effects from various drugs. Although not verified by any cer-

tified medical resource, yet these reviews are quite valuable to a common man who

believes Internet to be the greatest source of second medical opinion. Addition-

ally, selecting any of the side effects in the right column displays the testimonials

specific to that particular symptom.

4.3 Determining Top Drugs for each Symptom

Patients who take multiple medications are generally interested in knowing which

drugs cause a specific side-effect. This section of our work provides the necessary

reverse information. We examine an example of this utility by considering the

side-effect Dizziness. Table 4 represents the Top 20 Drugs for which patients have

reported Dizziness as one of the side-effects. The table also presents the percentage

distribution of each drug for that specific symptom. Fig 12 shows a pie chart

distribution of the same.

44

Drug Name % of patients reporting DizzinessYasmin 3.38%

Effexor XR 2.90%Levaquin 2.70%Mirena 2.69%

Lisinopril 2.33%Cymbalta 2.14%

Effexor 1.95%Lamictal 1.89%Lyrica 1.69%

Toprol-XL 1.59%Flagyl 1.50%

Wellbutrin 1.42%Lexapro 1.36%

Paxil 1.27%Coumadin 1.24%

Lipitor 1.23%Buspar 1.23%Zoloft 1.16%

WELLBUTRIN XL 1.14%Topamax 1.13%

Table 4: Sample reverse mapping from Side-effect -¿ Drug for Dizziness

Figure 12: Pie Chart distribution for Drugs reporting Dizziness as a side-effect

45

5 Discussion

5.1 Challenges & Solutions

In the previous sections we have examined in detail all the phases involved in Side-

effect extraction as well as representation of extracted information in the User

interface. In this chapter, we focus on discussing a few important issues that were

encountered and the solutions implemented to alleviate these problems. Although

there were several technical implementation difficulties, we restrict our discussion

to only the two major challenges which pertain to actual domain in context, which

is medical data. The topics of discussion include:

• Elimination of Cures appearing as top side-effect

• Synonym side-effects

The next two sections deal with the above two issues at length.

5.1.1 Eliminating Cures

In section 5.2, we discussed the methodology behind determining the top side-

effects exhibited by a particular drug. The top side-effects are determined based

on the frequency of occurrence of the term in all the reviews associated with the

drug. This proves to be very essential consideration since patients and users of the

internet show great interest in knowing about what is most common rather than

what is the most serious side-effect when on any medication. The frequency based

calculation for this feature posed one of the main challenges in this work. We

first examine a few examples in Table.5 which are indicators for the issue at hand.

These examples were some of the first few samples that we had noticed during our

experimental process that led to uncovering the issue and therefore can provide a

good insight into the problem.

46

DepakoteTop Side-effects:Seizures – 15.8%

Weight gain – 7.47%Hair loss – 2.74%

Depression – 1.93%Mood Swings – 1.58%

SingulairTop Side-effects:

Difficulty Breathing – 15.1%Depression – 8.82%

Allergy – 8.80%Anxiety – 8.59%

Mood Swings – 4.96%Cymbalta

Top Side – effects:Anxiety – 11.8%Nausea – 3.20%

Weight gain – 2.35%Insomnia – 2.14%Sweating – 2.12%

ZoloftTop Side-effects:

Depression – 15.2%Weight gain – 2.62%

Insomnia – 1.41%Nausea – 1.31%

Dry mouth – 1.22%

Table 5: Sample drugs and its top side-effects which are indicators of issue 1 faced

In table 5, the condition Seizures is reported by 15.8% of patients using Depakote

when the fact remains that it is infact used to cure/prevent seizures. Similarly

Zoloft treats depression, Cymbalta treats Anxiety and Singulair is used to ease

breathing difficulties.

This observation implies that it is very obvious to spot, that the top side-effect

reported for each of the drugs is significantly higher in frequency than the others

following it. This immediately raised a red-flag in our experiments and we thereby

went on to investigate it further. The interesting finding after observing the trend

with several drugs was,

The side-effect appearing as the top reported effect was in most cases the condition

that the specific drug is expected to cure.

Considering the fact that our calculations of top reported side-effects are frequency

based, this was not an uncommon result. Patients writing testimonials about any

drug are bound to use the words representing the conditions that the drug cures.

To solve this issue, the approach that we implemented was a 3-step process:

1. Crawl the official NIH (National institute of health) website and extract a list

of drug to cures mappings.

47

2. For each drug, examine the list of side-effects generated and detect outliers

which have a significantly high frequency percentage compared to others in

the list.

3. Finally, cross reference each of the outliers to the list obtained in step I to

detect the possible cures. These symptoms are then flagged in the database

as a ’cure’ and are restricted from appearing as a side-effect.

The reason behind having a list of cures is to ensure that are any genuinely high

reported side-effect for a drug does not get eliminated since it has a significantly

large frequency, thereby becoming the outlier. The only drawback with this ap-

proach is, there exists a set of drugs that are known to cause the same side-effect

as the ones that the drug is expected to cure. To illustrate this better, if Drug X is

administered to treat a condition Y, but instead ends up causing a higher degree

of condition Y, this gets eliminated and does not show up as a potential side-effect

despite the fact that it was very highly reported by patients.

5.1.2 Synonymous Side-effects

The second major issue was the presence of synonymous side-effects in the top

20 most frequent effects returned for each drug. Although this does not pose any

technical errors, it is still considered a noisy result since the same effect appears in

multiple forms. For example, Headche, Headaches and Ache in Head are all con-

sidered the same from the perspective of a user/patient. But the system considers

each of them as a unique side-effect and displays all the three in the list that it

generates. The solution we use here is to incorporate a synonym generator API

called Big Huge Labs which provides multiple versions of a particular word. Based

on the results returned in the top side-effects, only one of the versions is chosen

and displayed.

Although this solution solves the problem in most cases, there still are a few

48

exceptions. The API only generated valid synonyms but not singular and plural

forms. This leads to the case where Headache can be a synonym for Ache in Head

but not for Headaches. This is an area of possible future work where the problem

can be tackled by examining root words rather than the entire word in context.

5.2 Profoundness Score

During the phase of literature review, one of the main goals was to identify what

patients look for in medical websites. By extensively surveying the internet for

patient reviews, it was a common requirement for patients to be able to obtain

comparative analysis based on drugs or side-effects. Motivated by this fact, our

system defines a new parameter called ’Profoundness Score’ to project the impor-

tance of a side-effect in the context of a particular drug.

Profoundness Score for a side-effect is defined as:

”The Z-score or Standard score calculated over a population of data corresponding

to a specific category of drugs or entire corpus of drugs.”

Before understanding this term further, we discuss the concept of Z-score in general

statistical context. A common statistical way of standardizing data on one scale

so a comparison can take place is using a z-score. The z-score is like a common

yard stick for all types of data. Each z-score corresponds to a point in a normal

distribution and as such is sometimes called a normal deviate since a z-score will

describe how much a point deviates from a mean or specification point. It is

a dimensionless quantity derived by subtracting the population mean from an

individual raw score and then dividing the difference by the population standard

deviation. This conversion process is called standardizing or normalizing.

Therefore, Profoundness Score Pf for a side-effect SE in the context of Drug D is

calculated as:

49

This term ’population’ in this context can be either the entire set of drugs in our

corpus or drugs belonging to only a specific category of drugs.

We examine two examples in the next section where Zoloft is the drug in consid-

eration for calculating the Profoundness score and in two different populations.

Population 1: Complete Drug corpus and Pf-score of top side-effects of

Zoloft

Table 6 shows the calculated Profoundness scores of the top 10 side-effects ex-

hibited by Zoloft. The calculations are based on the formula explained above.

The frequency of occurrence of each side-effect is recorded in the UnigramSE and

NGramSE tables and therefore used as the datum in individual calucations.

In order to interpret these scores, we consider an example of ’Weight Gain for

Zoloft’. The frequency of occurrence of weight gain in Zoloft related testimonials

by patients is 261. Based on this, the Pf-score is calculated as 2.871. This means

that weight gain as a side-effect is reported around 2.8 (3) standard deviations

above the mean of frequency of occurrence of weight gain in the entire corpus. This

means that weight gain for Zoloft is over-represented when compared to other drugs

implying that the side-effect is around 3 times more profound for Zoloft. Similarly,

50

Side-Effect Frequency — Profoundness Scoreweight gain 261 — Pf-score is: 2.871insomnia 141 — Pf-score is: 0.815nausea 131 — Pf-score is: 2.938

dry mouth 122 — Pf-score is: 1.992dizziness 75 — Pf-score is: 1.755

mood swings 68 — Pf-score is: 0.643diarrhea 66 — Pf-score is: 1.463headache 59 — Pf-score is: 1.517yawning 56 — Pf-score is: 0.435sweating 45 — Pf-score is: 0.803

vivid dreams 42 — Pf-score is: 0.28weight loss 42 — Pf-score is: 0.388

Table 6: Top 10 side-effects of Zoloft and their corresponding Pf-scores

insomnia for Zoloft is under-represented and is less profound compared to other

drugs.

Population 2: Comparative Analysis based on Drug Category

This kind of comparative analysis provides a very good understanding of the impor-

tance of the side-effect for a drug. Although statistically significant information,

it fails to provide a good insight for patients who perform a comparative study of

various drugs within same category. For this specific reason, the second set of ex-

periments involved the population being restricted to drugs of the same category.

We examine an example where the category is Anti-Depressants and the drugs in

context are Cymbalta, Prozac and Zoloft.

The system determines the top side-effects for each drug and then calculates an

intersection set which contains the side-effects reported by all three drugs and its

corresponding Profoundness scores. Table 7 consolidates this example.

This analysis provides a way to compare side-effects caused by various drugs in

the same category, thereby giving the users an opportunity to explore the options

before making choices. The system also provides a feature to display the other

major side-effects for each drug as shown in table 8.

51

Side-effect Pf- Score Cymbalta Pf- Score Prozac Pf- Score Zoloftnausea 2.395 0.239 0.663

weight gain 1.08 0.062 1.618Insomnia 1.805 0.386 1.133Sweating 2.79 0.273 0.143Dizziness 1.884 0.471 0.6

Constipation 3.536 0.627 0.119Dry mouth 1.594 0.164 1.495Headache 1.717 0.318 0.761

Weight loss 1.176 0.027 0.279Vivid dreams 1.532 0.555 0.419

Table 7: Top common side-effects for drugs under Anti-Depressants category

The other side effects of CYMBALTA The other side effects of PROZAC The other side effects of ZOLOFTfever

bleedingear pain

muscle twitchinglight headedness

stuporinfection

muscle weaknessflushing

delusions

anxietybulimiahiccups

dry throatmuscle weakness

pelvic painstomach irritation

reduced sleeprigid muscles

abdominal discomfort

bleedingspotting

dehydrationbulimia

light headednesssoresdread

bruxismcoma

stomach irritation

Table 8: Representation of other major side-effects of the drugs in Anti-depressants category

52

6 Conclusions and Future Work

In this project, we set out to solve an almost impossible task of interpreting human

reviews to extract relevant terms in our domain. There were a set of sequential

tasks which were accomplished in order:

• Starting with data collection, we gathered nearly 400,000 user reviews about

nearly 2,500 drugs from the internet which found our medical data corpus. The

challenging part of this step was to identify the right sites in order to have a

quality dataset.

• Next we moved to the next step of parsing this massive collection of data.

The biggest road-block in this phase was not only identifying the pattern of

organization of data in every website, but also bringing these varied structures

together into a common data model. This involved individually treating each

site to a unique parsing routine specific to the layout architecture.

• In the next phase, we iteratively process the reviews to extract certain patterns

and build our dependency models, which is essentially the database schema.

• Finally, we define and implement the side-effect extraction algorithm which

gave us a unique list of nearly 900 side-effects extracted purely from the patient

reviews and without relying on fixed lists. This process helped us uncover a

few rare side-effects which are not generally reported by the pharmaceutical

companies or drug manufacturers.

At this point we have accomplished all the goals that we had started out with and

had even stumbled upon a few interesting road-blocks which were not anticipated

in the beginning. We proposed methodology for the following:

• tackle the problem of Synonymous side-effects

• Solving the interesting case of Cures appearing as top side-effect.

53

Above all, one of the biggest goals we achieved in this undertaking is presenting the

results of the aggregations, distributions and a lot of other useful data in a web-

interface. Our target audiences are patients who take drugs and medications and

the assumption here is, the patient using the interface only has basic knowledge of

using the internet to get the information he seeks. We therefore have invested a lot

of time and effort into making the front-end as friendly as possible. The technical

details and algorithm are left to the scope of the thesis and not expanded in the

interface.

This work is one of the first few undertakings in the direction of automatic side-

effect extraction in the medical domain. We believe that it is an excellent founda-

tion for future work focusing on some of the drawbacks of the system namely.,

• Integrating a Stemmer to find root words and determining side-effects based

on that factor.

• Using various Clustering and similarity metrics to determine the categories of

reviews and also side-effects. This kind of grouping can give a possible insight

into the ’Seriousness’ of the side-effect as well.

• To deal with the noise associated with user data, we believe that implementing

a spell check on the patient reviews could greatly reduce the number of ’non-

words’ and thereby generate an even bigger list of unique side-effects.

Our system would act as a solid foundation for other such user review-oriented

fields including mining for Top Rated movies, determining Top Teams in various

sports or even finding ’Top Friends’ on social networking sites!

54

7 Bibliography

1. Hiroshi Nakagawa and Tatsunori Mori. A Simple but Powerful Automatic

Term Extraction Method.

2. Beatrice Daille. Study and Implementation of Combined Techniques for Au-

tomatic Extraction of Terminology.

3. Sophie Aubin and Thierry Hamon (2006). Improving term extraction with

terminological resources.

4. Angelos Hliaoutakisy, Kalliopi Zervanouy and Euripides G.M. Petrakisy. Au-

tomatic Document Indexing in Large Medical Collections.

5. E. Milios, Y. Zhang, B. He and L. Dong. Automatic Term Extraction and

Document similarity in special text corpora.

6. Wentau Yih, Joshua Goodman and Vitor R. Carvalho. Finding Advertising

Keywords on Web Pages.

7. Diana Maynard and Sophia Ananiadou. Indentifying Contextual Infromation

for Multi-Word Term Extraction.

8. Wen-tau Yih, Po-hao Chang, Wooyoung Kim. Mining Online Deal Forums

for Hot Deals.

9. Gal Dias, Sylvie Guillor, Jean-Claude Bassano & Jos Gabriel Pereira Lopes.

Combining Linguistics with Statistics for Multiword Term Extraction: A

Fruitful Association?

10. Youngja Park, Roy J Byrd and Branimir K Boguraev. Automatic Glossary

Extraction: Beyond Terminology Identification.

11. Anette Hulth. Improved Automatic Keyword Extraction Given More Lin-

guistic Knowledge.

55

12. Jizhou Huang, Ming Zhou, Dan Yang. Extracting Chatbot Knowledge from

Online Discussion Forums.

13. Alexandre Patry and Philippe Langlais. Corpus-Based Terminology Extrac-

tion.

14. Jody Foo (2009). Term extraction using machine learning.

15. Web Crawling By Christopher Olston and Marc Najork.

16. Minqing Hu and Bing Liu. Mining Opinion Features in Customer Reviews.

17. EffectiveWeb Crawling by Carlos Castillo.

18. Web-references:

• www.cs.cmu.edu/ wcohen/collab-filtering-tutorial.ppt

• http://en.wikipedia.org/wiki/Collaborative filtering

• http://www.exinfm.com/pdffiles/intro dm.pdf

• http://en.wikipedia.org/wiki/Web crawler

• http://acl.ldc.upenn.edu/H/H05/H05-1043.pdf

• http://pages.stern.nyu.edu/ aghose/icec2007.pdf

• http://www.cs.uic.edu/ liub/publications/kdd04-revSummary.pdf

• http://www.ncbi.nlm.nih.gov/pubmed/16406474

• http://ezinearticles.com/?The-Power-of-Social-Media-For-the-E-Patient&id=4304655

• http://ezinearticles.com/?The-Doctor-Review-That-Tells-the-Truth&id=5881600

• http://htmlparser.sourceforge.net/

• http://httrack.kauler.com/help/Home

• http://www.ncbi.nlm.nih.gov/pubmed/9433730

• http://www.jabfm.org/cgi/content/full/19/1/39

56

• http://www.aclweb.org/anthology/H/H92/H92-1022.pdf

• http://rali.iro.umontreal.ca/Publications/urls/paper-tke-2005.pdf

Documents

SIDEFFECTIVE - SYSTEM TO MINE PATIENT REVIEWS: SIDE …