Creating a Computational Lexicon of Alcohol Consumption ...studentnet.cs.manchester.ac.uk/resources/library/3rd-year-projects/... · 5 Part 1 Introduction 1.1 Project overview Information

1

Creating a Computational Lexicon of

Alcohol Consumption from Twitter

Developed by Evghenii Varghin

Supervised by Sophia Ananiadou

School of Computer Science

University of Manchester

Computer Science with Business and Management

April 2016

2

Contents Introduction ............................................................................................................................................ 5

1.1 Project overview ............................................................................................................................... 5

1.2 Report structure ................................................................................................................................ 5

1.3 Motivation ......................................................................................................................................... 6

Background ............................................................................................................................................. 8

2.1 Social media ...................................................................................................................................... 8

2.1.1 Twitter API ..................................................................................................................................... 8

2.2 Text mining and normalization ......................................................................................................... 9

2.2.1 Computational lexicology ............................................................................................................ 10

2.3 Lexicon evaluation .......................................................................................................................... 10

2.3.1 Related work ................................................................................................................................ 10

Design ................................................................................................................................................... 12

3.1 Requirements .................................................................................................................................. 12

3.2 Database structure .......................................................................................................................... 13

3.3 Term finder ..................................................................................................................................... 14

3.4 Twitter search ................................................................................................................................. 14

3.5 Cleaning and normalizing data ........................................................................................................ 15

3.6 Information extraction .................................................................................................................... 15

Implementation .................................................................................................................................... 17

5.1 Collecting and storing tweets ......................................................................................................... 17

5.2 Data normalization.......................................................................................................................... 18

5.2.1 Tokenization ................................................................................................................................. 18

5.2.2 Slang and acronym normalization ............................................................................................... 19

5.3 Part-of-speech tagging .................................................................................................................... 20

5.4 Named-entity recognition ............................................................................................................... 20

Results and evaluation ......................................................................................................................... 22

6.1 Results ............................................................................................................................................. 22

6.2 Evaluation of the extracted corpus ................................................................................................. 23

6.2.1 Search keywords comparison ...................................................................................................... 24

6.2.2 Alternative computational approach to corpus evaluation for future work ............................... 24

Conclusions ........................................................................................................................................... 25

7.1 Work overview ................................................................................................................................ 25

7.2 Future work ..................................................................................................................................... 25

3

7.3 Summary ......................................................................................................................................... 26

References ............................................................................................................................................ 27

4

Acknowledgements

I would like to take this opportunity to thank my supervisor Sophia Ananiadou

for her continuous support and valuable feedback throughout the stages of

project development. This was a completely new field for me and I would not

be able to achieve such progress without her guidance.

I would also like to thank my parents from the bottom of my heart for

providing me with this amazing opportunity to study abroad. I cannot express

how much this means to me and what opportunities have opened to me as a

result of their decisions and support.

5

Part 1

Introduction

1.1 Project overview

Information is the key building block in today's world of Big Data. Extracting information

and making sense of it has become a valuable process not only for large corporations, but

also in scientific, social and other researches. There are multiple sources to extract

information from. For businesses it can be internal or market data, for researches it can be

through conducted interviews or questionnaires, however, there is another source that is

quickly becoming one of the largest of all – social media. Social media has seen an incredible

growth in the past couple of years, with twitter alone, having almost 320 million monthly

active users [1], which post, discuss, comment and share information via this platform.

Through these processes users create an extraordinary pool of information that can be used

for multiple different purposes. This project focuses on how to extract such information and

build a lexicon on a specific topic from the complete pool of twitter data.

1.2 Report structure

This report has been broken down into 6 parts. The remaining of this part will discuss

personal motivation for choosing this project and what were its main goals and objects.

Second part will provide an overview and background of all the technical concepts that have

been used during the creation of this program, or to which this project is related and which

will be applied in the future.

Third and Fourth parts focus on the design and the implementation of technical tools in this

program.

Finally, “Results and evaluation” part will outline what was achieved and the last part will

provide a reflective summary on the completed tasks, as well as the future of this project.

6

1.3 Motivation

I believe that information extraction and analysis already drive today's fast-paced world. As

sources of information grow and its structure is becoming more and more complicated,

developers are coming up with more sophisticated data mining tools. Understanding the

importance of such tools and their potential impact in the world, I wanted to create a

project that would utilize some of these tools and techniques, in order to learn how they

can be implemented to solve text mining tasks.

Another part of motivation behind choosing this project is to feel comfortable working in an

area, which I was totally unfamiliar with before the start of the project. I wanted to develop

a skillset that would allow me to face any new tools and programs with confidence, because

I have already proven to myself that I can overcome complicated learning curves to

complete a task.

Finally, I wanted to contribute to a social project. Coming into this degree and throughout

the first years in the university I was always looking for the ways how computer science

skills, which I have developed, could be applied to real-world problems. I was fascinated to

see how I can apply my knowledge in programming to develop a project that can potentially

help to tackle some of the health and social problems in our society.

1.4 Goals and objectives

This project’s main goal was developing a computational lexicon of alcohol consumption

from Twitter. The collected and normalized data would allow doctors and psychologists to

analyse and study the corpus related to alcohol consumption in a structured and efficient

manner. In such studies potential relationship between certain types of twitter posts and

alcoholism could be found, allowing doctors to have more information about different

causes of alcoholism and take proactive actions in helping people to prevent this disease.

Giving the overall big size of the task, this project was designed to provide a starting ground

for such researches. The following list of objectives has been developed for this project:

i. Develop a program that would analyse data of a given section from health forum

in order to find valid keywords for the Twitter search.

ii. Collect data from Twitter and store it for further processing.

iii. Clean and normalize the collected data.

iv. Extract meaningful information form the normalized corpus.

7

v. Develop a flexible program that would allow researchers to collect and analyse

data from other domains.

vi. Outline evaluation methods that are required to create the computational

lexicon.

8

Part 2

Background

2.1 Social media

Social media has become one of the biggest internet sensations over the past decade. Not

only has it had a great impact on people’s life allowing them to stay better connected *2],

but it is almost the largest sources of information worldwide. This report will outline two

social media sources that were used in the project: Twitter and health forums.

Twitter is an online social media platform that allows its users to send and read short (up to

140 characters) messages. This service is being referred to as microblogging. Twitter’s

metadata primarily consists of user-mentions (“@”) and hashtags (“#”). Hashtags are often

used to add sentiment to user’s posts (e.g. #happy, #fun) [3], while user-mentions allow

people to address their tweets to specific users, or entities. Since 2013, Twitter introduced

an additional set of metadata, which includes geo-locations, language, time and other

information for each tweet [4] [5].

Health forums are another type of social media platforms that allow people to share health

practices with others and seek advice anonymously or identified regarding any issues they

might have. Such forums are typically monitored by qualified professionals to make sure any

harmful content is promptly removed [6]. As health has become the most widely searched

topic on the Internet [7], health forums contain large volumes of information, which can be

used for research purposes.

2.1.1 Twitter API

Twitter provides two different application programming interfaces (APIs) for developers to

access its data: Search API and Streaming API.

Search API is part of Twitter’s REST API. It allows developers to access and collect historical

Twitter data published in the past 7 days. Search API provides a set of tools that allow

developers to test, modify and validate search queries in order to get the most reliable

9

results. Search API has a rate limit of making 450 queries/requests per 15 minutes when

using application-only auth [8].

Streaming API is used for collecting real-time published tweets, as long as they match the

search criteria [9]. Streaming API can collect a much higher flow of tweets (180k tweets per

hour vs. 72k tweets per hour of Search API), however, it does not have the tools to create

more advanced search queries, making the overall range of collected data narrower [10].

2.2 Text mining and normalization

Text mining is a sub-category of data mining and is primarily concerned with deriving high-

quality information from unstructured text. This is usually achieved through the application

of various computer tools and techniques and can be applied to solve tasks of varied

complexity. Starting from a simple count of the number of occurrences of a specified word

in a document, text mining tools can identify and semantically classify named entities, as

well as different relationships between [11].

One of the main objectives of text mining tools is text normalization before further

processing. Text normalization ensures consistency of the generated output, dealing with all

instances of ambiguity in text. There are multiple natural language processing techniques to

solve tasks of lexical normalization, some of which are listed below:

Tokenization – tokenization task is often considered as a required pre-requisite

before applying any other NLP tools and methods on the extracted corpus. It refers

to splitting the input sequence of characters into words, punctuation symbols or

other meaningful basic linguistics [12]. The typical method of splitting a string is

usually performed based on the occurrences of whitespace characters; however,

there are many algorithms that use more sophisticated methods (e.g. “shouldn’t” ->

,“should”, “n’t”-).

Word stemming – word stemming is performed to achieve more consistency across

the lexicon. The process can be summarized by “reducing inflected (or sometimes

derived words) to their word stem, base or root form” *13]. To solve this task for a

complete corpus, multiple stemming algorithms were developed [14].

Part-of-speech tagging – POS tagging is referred to as a process of assigning an

appropriate tag to each token in an input character sequence, based on their

meaning and the context where they appear [15]. Due to the possibility of a single

word being referred to as different parts of speech, such tagging algorithms often

use probabilistic approach to assign correct labels.

10

2.2.1 Computational lexicology

Computational lexicology is a branch of computational linguistics, which is responsible for

analysing how computers are used in the study of lexicon. It has contributed to a wide

range of tasks which are related to how computers process, interpret and produce human

language [16]. By doing so, studies in computational lexicology have also identified what are

the current limitations of print dictionaries for computational purposes. [17]

2.3 Lexicon evaluation

Evaluation of the results is one of the most important aspects in creating a computational

lexicon. There are multiple techniques that can be used for lexicon evaluation to ensure the

autonomy and reliability of any text mining work [18]. This report will explore the following

lexicon evaluation methods:

Evaluation of extracted corpus – The typical evaluation of text mining techniques is

performed against “gold standards” *19+. This part of evaluation is concerned with

the relevance of data that has been retrieved. This project was aimed at creating a

computational lexicon of alcohol consumption from Twitter; therefore, the extracted

corpus was evaluated on its relevance to this topic, when opposed to it being generic

alcohol-related discussions.

Evaluation against the existing resources – A computational lexicon can be evaluated

against an existing resource of the same or similar domain. Such evaluation can

demonstrate how the newly created lexicon compliments or adds to an existing one,

improving its coverage of the domain.

Domain-specific evaluation – A computational lexicon can be evaluated by domain

experts. This usually requires manual annotation of gathered corpus, preferably, by

two or more domain experts. This task will compare annotations given by the text

mining tools with the ones provided by domain experts, thus, evaluating the

effectiveness and reliability of such tools.

2.3.1 Related work

Related work on lexicon evaluation can be found in almost any paper, which is concerned

with the creation of a computational lexicon, as it is an integral part of such researches. This

11

project has consulted and studied tools and evaluation methods used by Thompson et.al. in

the creation of the BioLexicon [20]. Some of the evaluation methods used by Thompson

et.al. included evaluation of normalization rules according to ambiguity and variability

metrics and evaluation of the created lexicon against existing resources – WordNet [21] and

the SPECIALIST lexicon [22]. All performed evaluations have demonstrated positive results,

highlighting the effectiveness and efficiency of text mining techniques used for creating

BioLexicon.

12

Part 3

Design

3.1 Requirements

Due to the large scale of the project, final requirements were narrowed down and simplified

in order to accommodate the available time and knowledge constraints. One of the main

goals and requirements for this project included becoming familiar with the tasks of text

mining and lexical normalization, thus initial high priority requirement involved collecting

and studying a number of related works in this domain.

Other requirements were gathered at different stages of project development. The final list

of program requirements and their priorities is outlined in table below.

Requirement Priority

Term finder 1 Collect data from health forum Low

2

Analyse data to identify correct search keywords Low

Corpus extraction 3 Extract twitter corpus based on identified search keywords High

4

Design a database structure for the extracted corpus Medium

Text normalization 5

Clean extracted corpus from noisy data (non-ASCII characters, links, twitter symbols) Medium

6 Normalize slang and acronyms High

7

Identify named entities in text (names, locations, organizations) High

Information extraction 8

Extract verbs, nouns and adjectives to find most occurring words High

9

Build visual graphs based on extracted information Low

13

3.2 Database structure

The first stage of the design process was to develop an appropriate data structure in order

to store the collected twitter corpus. This included deciding on what type of database to use

and how the data would be stored inside that database.

The final version of the database type was decided to be a comma-separated values (CSV)

file. As the required version of the program output was designed to be a computational

lexicon, CSV-type database added the flexibility and easy-to-read design to the data. A CSV

file can be easily opened with Microsoft Excel or any similar software, providing a clear

structure and the ability for the user to manually view and modify the content.

Database columns design:

SourceTweet – the unmodified version of the extracted tweet. Stored for validation

and evaluation purposes.

TweetID – the unique identification of the extracted tweet. Allowed the program to

remove any duplicate tweets that could have been collected using different search

keywords.

Location – if the user has enabled the location tracking option on his device, this field

would provide a geo-location of the tweet in the format

GeoLocation{latitude=39.6503, longitude=-75.7923}. Stored for information

extraction purposes.

Time – timestamp of the tweet. Displays the time of the time zone of the location

from which the tweet was send, but not collected. Stored for information extraction

purpose

hasLink – boolean variable displaying whether a tweet had a link. Stored for statistic

gathering purpose.

14

hasNonASCII – boolean variable displaying whether a tweet had a non-ASCII

character. Stored for statistic gathering purpose.

hasSlang – boolean variable displaying whether a tweet had a slang or acronym

word. Stored for statistic gathering purpose.

hasOOVWord – boolean variable displaying whether a tweet had a word which could

not be found in English or slang dictionaries. Stored for statistic gathering purpose.

hasNE – boolean variable displaying whether a tweet had a Named Entity (name,

location, organization). Stored for statistic gathering purpose.

3.3 Term finder

Term finder tool has been designed during the stage of twitter corpus collection. Its main

purpose was to add autonomy to the previously selected search keywords, confirming their

validity.

The approach to term finder was to develop a separate program that collected data and text

from a resource appropriate to alcohol consumption domain. For this purpose a section

from a health forum that was related to alcoholism was identified as a relevant resource

[23]. A program was designed to loop through the last 30 pages of the forum section (600

topics), collecting all the text from topic discussions and replies. The most used words and

terms from this corpus were analysed to identify reliable search keywords for twitter

extraction.

3.4 Twitter search

As the primary objective of this project was creating a computational lexicon the main focus

of the program was to collect, normalize and evaluate the historical twitter data. In order to

satisfy this requirement Twitter Search API has been selected as the primary method of

searching and collecting tweets over the Twitter Streaming API.

Search API was found to be useful and reliable, as there was enough available data to

collect. Given this factor, API’s focus on search relevance allowed to maximize the number

of domain-specific tweets that were extracted.

15

Twitter extraction has been designed to search the application’s database for tweets

published in the past 7 days and store the result in order to provide a good base twitter

corpus for further normalization and text analysis.

3.5 Cleaning and normalizing data

In order to create a computational lexicon all collected data must be cleaned from noise and

normalized to achieve maximum consistency across text.

This part of the program has been designed to complete the following tasks:

Cleaning the twitter noise – twitter data can create a large amount of noisy data due

to the fact that it accepts a wide range of characters that are allowed to be included

in a tweet. These can include emoticons, symbols and characters from other

languages. As only the tweets written in English language were considered relevant

for this project, such characters (all referred to as non-ASCII characters) were

designed to be excluded from the corpus. All links and pictures found in twitter text

were also designed to be excluded as part of this task.

Normalizing the twitter text – performing text normalization was an important part

of the project in order to achieve the final result. This included applying text mining

methods to achieve consistency in letter capitalization, punctuation and other

aspects of text. This would allow for a normalized corpus to be created, which could

reliably be processed by natural language processing tools.

Normalizing slang and acronyms – this is an extended task coming from

normalization of twitter text, which specifically focused on normalizing any detected

occurrences of slang and acronyms. The use of slang and acronyms is popular among

the twitter users due to the character limit in a single tweet, so this project was

designed to address this task to the most effective degree in order to create a

reliable and consistent twitter corpus.

3.6 Information extraction

For the purpose of this project a simple information extraction has been designed. As per

requirements, this project was aimed at performing part-of-speech tagging task on the

16

extracted twitter corpus in order to collect all adjectives, nouns and verbs from text. The list

of all unique parts of speech would then be stored in separate files together with the count

of occurrences of each word.

This task was designed to provide a better understanding of the created lexicon. Analysing

what are the most used adjectives or verbs in the corpus could provide insight about the

overall sentiment of the alcohol consumption lexicon. Furthermore, with additional

functionality this information extraction can be considered as starting point for further

lexicon analysis, such as analysing the contextual use of selected parts of speech by bringing

all related tweets in which the analysed word has been used.

17

Part 5

Implementation

This chapter will discuss what implementation methods and tools were used to develop this

project.

5.1 Collecting and storing tweets

As previously mentioned in the Design chapter, collecting and storing tweets was an integral

part of the project. First, search keywords had to be identified by creating and running the

term finder program.

This program has been created using the Ruby programming language [24]. Ruby language

has been selected for this task, because of its efficient libraries (gems) related to web-page

handling – Nokogiri and Watir-Webdriver. With the implementation of these gems, Ruby

program was able to loop through the given number of topics, collecting all relevant text

and storing it in a file. Next part of this sub-task was to analyse the extracted text and

identify correct search keywords based on more sophisticated method that just word count.

For this purpose TerMine tool, developed by The National Centre for Text Mining, was used.

TerMine is a system for terminological management featuring term extraction and acronym

recognition. The term extraction employs the C-value method that incorporates linguistic

filters and statistics for recognising terms, making it ideal for this task [25].

Next step of this implementation part was to use the extracted terms to search and collect

data from twitter. There are a number of libraries that can be used by applications to access

twitter’s data. Twitter4j Java library [26] has been identified as the best for this purpose for

its good documentation and readability. Java program with the implementation of this

library has been developed for this task searching and collecting twitter data in a secure and

effective manner.

Finally, gathered tweets were stored locally for further processing. As previously defined in

the design stage, CSV file has been chosen as the tweet storage method. It used the pipeline

character (“|”) to separate between columns of data. As any occurrence of this character

would split the data into different columns, pre-processing was required to remove all the

occurrences of the pipeline character in tweets, in order to ensure consistent and reliable

database structure.

18

5.2 Data normalization

First part of data normalization task required cleaning all the tweets from noisy data. This

included removing all non-ASCII characters, links and twitter-specific characters from the

extracted corpus. This task has been performed with the implementation of Regular

Expressions (RegEx), which identified all occurrences of selected characters and removed

them from text with a rule-based approach. Some examples of this implementation are

shown below.

Removing all non-ASCII characters ("[^\\x00-\\x7F]"):

Input: “The good Lord has changed water into wine, so how can drinking beer be a

sin?Ã‚Â Ã¢â‚¬â€œBelgium”

Output: “the good lord has changed water into wine so how can drinking beer be a

sin ? belgium”

Removing all links (“https\\S+|http\\S+”):

Input: “Got to drink the local beer while in Roswell. - Drinking a Roswell Alien Amber

Ale @ Billy Rays Bar - https://t.co/qY6OAmy703 #photo”

Output: “got to drink the local beer while in roswell drinking a roswell alien amber

ale at billy rays bar photo”

The next part of text normalization required tokenization and slang handling.

5.2.1 Tokenization

Tokenization is a necessary step in development before more sophisticated NLP methods

can be applied to text. For this part of implementation the Apache OpenNLP libraries [27]

were selected over the other options (e.g. Stanford CoreNLP) due to its good

documentation.

The Apache OpenNLP Tokenizer segments the input character sequence into tokens. As in

most other systems, the process is performed in two stages: first, sentence boundaries are

identified, then tokens within each sentence are identified.

Input: “someone is drinking beer with a straw. stop him”

Output: ,“someone”, “is”, “drinking”, “beer”, “with”, “a”, “straw”, “.”- ,“stop”,

“him”}

19

5.2.2 Slang and acronym normalization

Slang and acronym normalization has been given the most focus in this project, as they

often occur in social media and their handling is crucial for creating a consistent and reliable

lexicon. It was important to ensure that all terms and acronyms, which have the same

meaning, are stored identically so that the best efficiency of NLP tools is achieved for future

analysis.

For this purpose the most complete lists of English words and slang were considered to

perform look-up operations against them. A local database of slang words and their

meanings was created from an online resource (noslang.com) [28], which contained more

than 5000 English slangs and acronyms. Another database of more than 300,000 English

words has been implemented in the look-up operations. If a word was not found in the

English dictionary, a look-up against the slang dictionary was performed, swapping the

acronym with its fully spelled meaning (e.g. “b4” -> “before”).

Input: “Bernie Sanders was drinking beer on national tv last night , for that doesn't

prove that he should be president then idk what will .”

Output: “bernie sanders was drinking beer on national tv last night for that does

not prove that he should be president then I don't know what will”

The slang normalization process pipeline is shown on the figure below.

20

5.3 Part-of-speech tagging

Following the data normalization and tokenization parts, the program had implemented a

POS tagger module. This module took the tokenized tweet as its input and marked all tokens

with their corresponding word types based on the context and the token itself. This was

implemented with the use of another Apache Open NLP tool, which produced an array of

tags, each corresponding to the input token sequence, as its output.

Part-of-speech tagger uses a probabilistic model to predict the correct word tag out of the

set. It can be further improved with the implementation of tag dictionary, which would

increase the tagging and runtime performance [27].

Finally, all adjectives, nouns and verbs found with the use of the tagger were extracted and

stored separately in the respective files for future analysis.

5.4 Named-entity recognition

The final part of the program implementation required the development of a Named-entity

(NE) recognition module. The module was designed to analyse each word inside the twitter

corpus on its relation to being a person name, location or organization. This was achieved by

loading three separate Apache OpenNLP pre-trained NE models (location name finder,

person name finder and organization name finder models) and analysing each word from

twitter data against these models. Any successful detection of such instances would mark

the given tweet as having a Named-entity in it.

The process pipeline of this module is show in figure below.

21

22

Part 6

Results and evaluation This chapter will outline what results were achieved as part of this project and what

evaluation methods were used to validate them.

6.1 Results

The program has extracted a total of 84632 tweets in relation to alcohol consumption. Due

to high computational load and time constraints, a total of 9174 tweets were processed and

normalized using methods described in “Development and implementation” chapter.

Four files have been generated as the result of the program’s output – main corpus file,

containing processed and normalized tweets, and 3 files, each containing extracted

adjectives, nouns and verbs respectively.

The distribution of adjectives, nouns and verbs is shown in figure below. A summary of the

most used adjectives, nouns and verbs can be found in Appendix 1.

8785

40008

15403 Adjectives

Nouns

Verbs

23

6.2 Evaluation of the extracted corpus

For the purpose of this project, it was important to evaluate the extracted corpus on its

relevance to the selected domain – alcohol consumption. For this task a subset of 1000

tweets has been manually analysed and annotated on its relationship to this topic. The

results of the evaluation are shown in figure below.

929 (93%) tweets of the selected subset were identified as alcohol consumption related.

Example of alcohol consumption related tweet:

“One of the best ones I have ever tried :). - Drinking a Cusque Trigo Wheat Beer - *link*

#photo”

Generic alcohol related tweet:

“No. I do not associate drinking beer with eating.”

While the second example has the search keywords in it, it was not considered as alcohol

consumption related and was marked as False Positive (FP).

Such overall positive evaluation of the extracted corpus can be related to the techniques

used in Twitter Search API. An important feature of the Search API is its focus on relevance

and not completeness [8], which means the extracted tweets do not necessarily fully match

the search criteria; however, they are more likely to be relevant to the topic. For example,

tweets “Day 1: listening to David Bowie and Mac DeMarco on vinyl in my living room whilst

drinking beer. Stay tuned for day 2.” and “Moderately high heat, great dark chocolate notes.

- Drinking a Habanero Supernova @ Dogberry Brewing - *link*” were both extracted using

the search phrase “drinking beer”.

93%

7%

Alcoholconsumption

Generic alcoholrelated

24

6.2.1 Search keywords comparison

An important observation has been made in relation to the effectiveness of different search

keywords. A further corpus evaluation has been conducted on 2 subsets of tweets extracted

with keywords “ginger ale” and “party drinking”, in order to compare the results of different

search phrases. Below figure shows the comparison of results of this evaluation.

The graph shows that the best results were achieved extracting tweets with the search

keywords “drinking beer” with a 93% True Positive instances. Applying search keywords

“ginger ale” has seen a significant decline in performance, showing a 32% of False Positive

results (4.5 times higher than “drinking beer” subset), extracting tweets, which are not

related to alcohol consumption.

This evaluation has further highlighted the importance of careful selection of search

keywords, specifically in the case when working with the Twitter Search API, as it may have

a big impact on the relevance of the extracted corpus to the selected domain.

6.2.2 Alternative computational approach to corpus evaluation for future

work

An alternative corpus evaluation method will be developed as part of future work on this

project. Its main focus will be the implementation of Naïve Bayes classifier, which will use

already extracted and partially labelled data set, as its training data.

Such approach would create an automated method of corpus evaluation, while making it

faster and more efficient. It can also help in evaluation of search keywords, identifying the

most reliable and effective ones.

0

100

200

300

400

500

600

700

800

900

1000

drinking beer ginger ale party drinking

TP

FP

25

Part 7

Conclusions

This chapter will provide a reflective overview of the accomplished results, as well as

provide plans for future project development.

7.1 Work overview

The program that has been developed for this project managed to deliver a sensibly

normalized twitter corpus of alcohol consumption. The results of this normalization can be

visually seen by comparing the initially extracted corpus, which contained a lot of noisy and

meaningless data, with the normalized one. The normalized corpus provides a good

platform for future tasks that are required to create a computational lexicon.

Additionally, some initial data analysis and corpus evaluation tasks have been completed as

part of this project. The extracted and analysed parts of speech serve as a starting reference

point for anyone who wants to have a basic understanding of the content and structure of

alcohol consumption lexicon.

Finally, the term finder tool that has been developed as part of this project provides

researchers with extra functionality and flexibility. With improved development and

integration of all parts in one system this tool can be used to quickly analyse and compare

lexicons of different domains and to find relationships between them (e.g. alcoholism vs.

alcohol consumption lexicons).

7.2 Future work

The process of creating a computational lexicon has raised more potential program

implementations than there was time to develop them. There are exciting text mining and

natural language processing tools and techniques that are yet to be implemented in order to

create a fully functional computational lexicon. The next stages of development will focus

on adding the following features and functionality:

26

Stemming and lemmatisation. This will be an important part of future development,

as the implementation of such tools will greatly improve consistency of the

generated lexicon. At this point in time the program incorrectly processes words like

“listening” and “listened” separately, due to the lack of this functionality. Stemming

and lemmatisation will increase the performance of the part-of-speech tagger

module, showing more consistent and reliable results.

Lexicon evaluation tools. As previously mentioned in the report, corpus evaluation is

one of the most important aspects in creating a computational lexicon. Due to the

time constraints and the size of this task it was difficult to implement any sensible

evaluation tools at this stage of the project. One of the evaluation tools that will be

developed as part of the future work will include a GUI interface that researchers

can use in order to label and annotate words, terms and their relationships inside

the extracted and normalized corpus. This would be an important step towards the

validation of the computational lexicon.

Contextual analysis of the extracted words. It would be a great addition to the

project, if researchers can analyse the context of extracted adjectives, verbs and

nouns. This can be achieved by linking every tagged word with a list of source

tweets that contained it. Adding Named Entities to this list of extracted words will

also add more information and analysis points for the created computational

lexicon.

7.3 Summary

Overall this project has definitely been a great learning experience for me. To be honest, at

the start of the project I was not even able to describe what would be the end result of it. It

took many long hours of reading through related work in this domain to be able to fully

understand the final deliverables, but my clear goal of learning these tools and the desire to

have a better understanding of different Big Data analytics methods have kept me going and

have proven this project to be a good choice. The study of related work done by the

National Centre for Text Mining in the medical text mining domain and their creation of the

BioLexicon has taught me that there is much more to creating a fully functional

computational lexicon than I first imagined.

While the targeted computational lexicon was not fully completed, this project went beyond

my initial expectations. Now when I have more advanced understanding of text mining and

natural language processing tools together with the working program, I believe that through

the future development work I have a chance to contribute to the health and science world

by developing a computational lexicon of alcohol consumption.

27

References

[1] “Twitter Company – About” 2016. Available at: https://about.twitter.com/company

[2] Morrow, M. (2014). Social Media: Staying Connected. Nursing Science Quarterly, Vol.

27(4) 340

[3] D. Davidov, O. Tsur, and A. Rappoport, “Enhanced sentiment learning using twitter

hashtags and smileys” in Proceedings of the 23rd International Conference on

computational Linguistics: Posters, pp. 241–249, Association for Computational Linguistics,

2010.

*4+ Dwoskin, E. (2014). “In a Single Tweet, as Many Pieces of Metadata as There Are

Characters”. The Wall Street Journal. Available at:

http://blogs.wsj.com/digits/2014/06/06/in-a-single-tweet-as-many-pieces-of-metadata-as-

there-are-characters/

*5+ “Introducing the new metadata for Tweets” (2013). Available at:

https://blog.twitter.com/2013/introducing-new-metadata-for-tweets

[6] “Healthchannels forum”. Available at:

http://www.healthcommunities.com/health/forums.shtml

[7] “Health Forums List” (2012). Available at: http://forumlist.info/health-forums-list/

*8+ “Twitter Search API”. Available at: https://dev.twitter.com/rest/public/search

*9+ “Twitter Streaming APIs”. Available at: https://dev.twitter.com/streaming/overview

[10] “Aggregating tweets: Search API vs. Streaming API”. Twitter API Consulting. Available

at: http://140dev.com/twitter-api-programming-tutorials/aggregating-tweets-search-api-vs-

streaming-api/

[11] Thompson, P., Theresa Batista-Navarro, R., Kontonatsios, G., Carter, J., Toon, E.,

McNaught, J., Timmermann, C., Worboys, M., Ananiadou, S. (2016). “Text Mining the History

of Medicine”. Available at:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144717#authcontrib

*12+ “Tokenization”. Available at:

http://searchsecurity.techtarget.com/definition/tokenization

*13+ “Stemming”. Available at: https://en.wikipedia.org/wiki/Stemming

[14] D. A. Hull, “Stemming algorithms: A case study for detailed evaluation,” JASIS, vol.

47,no. 1, pp. 70–84, 1996.

https://about.twitter.com/company

https://blog.twitter.com/2013/introducing-new-metadata-for-tweets

http://www.healthcommunities.com/health/forums.shtml

http://forumlist.info/health-forums-list/

https://dev.twitter.com/rest/public/search

https://dev.twitter.com/streaming/overview

http://140dev.com/twitter-api-programming-tutorials/aggregating-tweets-search-api-vs-streaming-api/

http://140dev.com/twitter-api-programming-tutorials/aggregating-tweets-search-api-vs-streaming-api/

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0144717#authcontrib

http://searchsecurity.techtarget.com/definition/tokenization

https://en.wikipedia.org/wiki/Stemming

28

[15] R. Feldman and J. Sanger, The text mining handbook: advanced approaches in analysing

unstructured data. Cambridge University Press, 2007.

*16+ “Computational Lexicology”. Available at:

https://en.wikipedia.org/wiki/Computational_lexicology

[17] Schubert, Lenhart, "Computational Linguistics", The Stanford Encyclopedia of

Philosophy (Spring 2015 Edition), Edward N. Zalta (ed.), Available at:

http://plato.stanford.edu/archives/spr2015/entries/computational-linguistics

[18] Cohen, A. and Hersh, W. (2004). “A Survey of Current Work in Biomedical Text Mining”.

[19] Tekiroglu, S., Ozbal, G. and Strapparava, C. (2014). “A Computational Approach to

Generate a Sensorial Lexicon”.

[20] Paul Thompsonr, John McNaught, Simonetta Montemagni, Nicoletta Calzolari, Riccardo

del Gratta, Vivian Lee, Simone Marchi, Monica Monachini, Piotr Pezik, Valeria Quochi, CJ

Rupp, Yutaka Sasaki, Giulia Venturi, Dietrich Rebholz-Schuhmann and Sophia Ananiadou

(2011). “The BioLexicon: a large-scale terminological resource for biomedical text mining”.

BMC Bioinformatics, 12:397

[21] Fellbaum C, WordNet: An electronic lexical database. 1998, MIT press Cambridge, MA

[22] Browne AC, Divita G, Aronson AR, McCray UMLS language and vocabulary tools. AMIA

Annu Symp Proc. 2003, 798-

[23] http://www.dailystrength.org/

[24] “Ruby Programming Language” Available at: https://www.ruby-lang.org/en/

[25] Frantzi, K., Ananiadou, S. and Mima, H. (2000) Automatic recognition of multi-word

terms. International Journal of Digital Libraries 3(2), pp.117-132.

[26] http://twitter4j.org/en/index.html

[27] Apache OpenNLP. Available at: https://opennlp.apache.org/

[28] http://www.noslang.com/

https://en.wikipedia.org/wiki/Computational_lexicology

http://plato.stanford.edu/archives/spr2015/entries/computational-linguistics

http://www.dailystrength.org/

https://www.ruby-lang.org/en/

http://twitter4j.org/en/index.html

https://opennlp.apache.org/

http://www.noslang.com/

29

Appendix 1

Verbs:

Word

Count

drinking 8302

want 523

does 481

is 615

'm 169

be 162

was 126

have 107

stop 101

do 97

are 88

aged 88

let 82

manning 78

get 68

sends 66

had 61

shop 59

're 55

've 52

am 50

watching 50

love 50

been 48

eating 46

enjoy 44

go 43

brewing 38

laughing 38

got 37

30

Adjectives:

Word Count good 366

pale 309

nice 212

black 201

red 183

imperial 136

great 133

old 129

double 128

big 125

white 117

brown 100

little 89

new 83

sweet 82

cold 81

last 78

hoppy 74

golden 74

delicious 74

dark 72

sour 70

bitter 70

best 66

bad 65

blue 59

irish 58

super 50

dry 48

first 48

31

Nouns:

Word Count photo 2727

beer 2567

ipa 1226

ale 1052

company 733

craft 599

peyton 522

stout 497

bud 438

brewing 313

porter 227

i 214

brewery 208

bar 207

hop 201

hopslam 179

lager 174

house 173

chocolate 142

coffee 135

wine 106

day 105

samuel 105

barrel 105

pub 104

co 102

time 100

night 94

love 92

winter 91

Documents

Creating a Computational Lexicon of Alcohol Consumption ...studentnet.cs.manchester.ac.uk/resources/library/3rd-year-projects/... · 5 Part 1 Introduction 1.1 Project overview Information