View
216
Download
0
Category
Preview:
Citation preview
8/4/2019 MATLAB Based Visual is at Ion
1/122
U N I V E R S I T Y O F S U R R E YD E P A R T M E N T O F C O M P U T I N G
B S C . ( H O N O U R S ) C O M P U T I N G A N D I N F O R M A T I O N T E C H N O L O G Y
FINAL YEAR PROJECT2002-2003
T ext C l as s if ic at i o n o f U SE N E T
m e s s a ge s f o r a C o n v er s a t i o n
Visua l i s a t ion Sys t em
CS300: FIN AL REPORT
JOLYON H UN TER
cs91jh@surrey.ac.uk
U RN : 1930192
SUPERVISOR: DATE:Dr. Andrew Salway
STUDENT: DATEJolyon Hunter
8/4/2019 MATLAB Based Visual is at Ion
2/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
1
BSC. (HO NO URS) COMPUTING AND I NFO RMATION TECHNO LOGYFINAL YEAR PROJECT
FIN AL REPORT
A B S T R A C T
T ex t C lassi f icat ion of U S E N E T messages for a
C onversation V isualisation S ystem
The widespread use of computers and the Internet in particular, as a means of communication has
seen more and more people connecting around the globe. These online discussions are usually text-
based, fast, simple and global in scope, yet they lack the non-verbal elements, the extra levels of
meaning and the emotion which face-to-face, person-to-person communication provides. Surely,
with the technology available to modern man, these meanings or emotions can be depicted visually?
This report details the investigation of how messages or conversations within USENET newsgroups
can be classified automatically as part of a system to visually represent online discussions. The project
aims and objectives are defined before the nature and history of USENET is described to provide
some context to the investigation. The current state-of-the-art in Visualisation Designs and Text
Classification methods is described before a period of observation is undertaken in an attempt to
define phenomena to visualise and features (i.e. particular words in the text) which characterise those
phenomena.
Once candidate features have been defined, a corpus of USENET messages from a range of
newsgroups is collected in order to conduct in-depth analysis into those features. This analysis
involves the use of the language engineering workbench System Quirk (to glean word frequencies
from the sample), and the clustering toolkit gCLUTO (to aid interpretation of the data). From this
analysis, example rules are defined which can be programmed into a system to classify the text
according to defined classes. An example conversation visualisation system is specified, before a
system to classify text is then developed, implemented, tested and evaluated.
8/4/2019 MATLAB Based Visual is at Ion
3/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
2
A C K N O W L E D G E M E N T S
If we k new what we were doing, it wouldn't be called research, would it?
ALBERT EINSTEIN
Dr. Andrew Salway for his continued support and advice with all aspects of this project.Also for keeping me sane, and making me realise that I didknow what I was doing.
CodeZ ebra: Sara & Erik in particular, thanks for letting me participate.
The Smashing Pumpkins, Dave Matthews Band and the rest, for getting me through the tough times;
Starbucks Coffee Co. for the even tougher times;
Personal thank s must go to
My Dad, and my sisters Eliane & Abigail, for their continuing support and love. I could not havedone this without you.
And of course
Mum always with me in spirit.
Forever my inspiration in everything. T his is dedicated to you.
8/4/2019 MATLAB Based Visual is at Ion
4/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
3
T A B L E O F C O N T E N T S
Abstract ...............................................................................................................................................................1
Acknowledgements ...........................................................................................................................................2
Table of Contents ...........................................................................................................................................3
1. Introduction .................................................................................................................................................5
1.1 Project Overview.............................................................................................................................5
1.2 A History of USENET ..................................................................................................................7
2. Systems to Visualise Conversation .....................................................................................................10
2.1 Visualisation Designs....................................................................................................................10
2.1.1 Loom ................................................................................................................................11
2.1.2 CodeZebra.......................................................................................................................14
2.1.3 Conversation Map..........................................................................................................16
2.1.4 Netscan ............................................................................................................................18
2.1.5 Visualisation Designs: How they compare.................................................................20
2.2 Methods for Classifying Text ......................................................................................................21
2.2.1 Smokey.............................................................................................................................22
2.2.2 WEBSOM .......................................................................................................................24
2.2.3 CLUTO............................................................................................................................26
2.3 Discussion ......................................................................................................................................28
3. Features for classifying USEN ET conversations...........................................................................29
3.1 Introduction ...................................................................................................................................29
3.2 Initial Observations of USENET Newsgroups .......................................................................30
3.3 Statistical Corpus Analysis...........................................................................................................33
8/4/2019 MATLAB Based Visual is at Ion
5/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
4
3.3.1 The Analysis Process .....................................................................................................35
3.3.2 Cyberspeak: Features for classification?.....................................................................37
3.3.3 Personal Pronouns: Features for classification?........................................................42
3.3.4 Synonyms of Agreement/ Disagreement: Features for classification?...................47
3.4 Summary: Creating some Rules?.................................................................................................51
3.5 Further Analysis: Phenomena Classification ............................................................................54
3.5.1 Summary: Creating more rules?...................................................................................59
4. Developing a System to Classify text for use in a Conversation Visualisation System ......62
4.1 Requirements Analysis .................................................................................................................63
4.2 Design .............................................................................................................................................65
4.3 Implementation .............................................................................................................................67
4.4 Testing.............................................................................................................................................69
4.4 Evaluation ......................................................................................................................................71
5. Closing Remarks ......................................................................................................................................75
5.1 Achievements.................................................................................................................................76
5.2 Discussion & Future Work..........................................................................................................78
References.......................................................................................................................................................79
Bibliography...................................................................................................................................................83
Appendices .....................................................................................................................................................86
APPENDIX 1: Supporting Material for Chapter 3 Analysis .......................................................87
APPENDIX 2: gCLUTO Input files...............................................................................................99
APPENDIX 3: Rule-Based Processor PERL code.................................................................101
APPENDIX 4: System Testing & Evaluation Questionnaire....................................................103
APPENDIX 5: Ten Sample Messages...........................................................................................106
8/4/2019 MATLAB Based Visual is at Ion
6/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
5
1 . I N T R O D U C T I O N
Cyberspace: A consensual hallucination experienced daily by billions of legitimate operators, in every nation
A Graphic representation of data abstracted from the banks of every computer in the human system
WILLIAM GIBSON N euromancer (1984:51)
1.1 Project Overview
Cyber-culture and the area of social interaction on the Internet (or in Cyberspace) is a
burgeoning area of research, not just for Computer Scientists, but for psychologists, sociologists and
linguists alike. The principle means of communication and interaction on the Internet are online
discussions, in the form of chat (IRC or Internet Relay Chat), Instant Messaging and newsgroup
discussion (USENET).
Although newsgroups in themselves can be extremely valuable sources of information, they tend to
be relatively inaccessible to outsiders or the general Internet user. For example, a discussion
centred around a topic requiring a high amount of technical knowledge in, say, Computing, might beinaccessible to a student of Archaeology (yet the information contained in the discussion may be of
relevance to the Archaeology student the problem is in accessing it). This has prompted some
academics to investigate ways and means of visualising these online discussions, so as to convey the
social qualities as well as the informative qualities of the discussion.
Pioneers in the field of visualising online discussions come from the Sociable Media Group (SMG)40
at the Massachusetts Institute of Technologys Media Lab, the Social Technologies Group at UC
Berkeley41, and more recently from Microsoft Research26, all of whom will be covered in detail in
Section 2.
Given the huge range of social interactions concerning many different topics, ranging from pet care
advice and recipes to debates about the origin of life, it is clear that opportunities for research exist in
8/4/2019 MATLAB Based Visual is at Ion
7/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
6
exploring new ways to classify, and ultimately display these interactions, in a more intuitive way than
is currently possible.
To facilitate this research aim, it will be necessary to carry out an in-depth examination of past and
current research in the area of visualising online conversations. As stated in the project inception
document, preliminary investigations into this research suggests that the focus tends to be more on
the visualisation process and the interface considerations of designing systems, rather than the actual
means of classifying text. From these research papers it should be possible to determine what
phenomena others have attempted to visualise, and this information will be useful to bear in mind
when investigating USENET messages in greater detail.
This project deals with automatic classification of USENET messages, which requires an analysis of
textual features which exist in messages, and an investigation into whether these can be meaningfully
represented in machine language i.e. coded into a program for automatic extraction.
An aim of this project is to make the automatic collection of that data much easier, and therefore aid
the design and implementation of a conversation visualisation system. Through greater automatic
classification of the text contained within messages, it should be easier to use the data to construct
meaningful visualisations: An ideal situation would be where one could go from numerical to
graphical representations and back again without losing accuracy of the data. To facilitate this, a
corpus of text (a minimum of 250,000 words) will be collated from one or a number of USENET
newsgroups and subsequently analysed. The aim of this analysis will be to determine patterns and
heuristics that will enable the creation of a rule set. These rules will then be incorporated into a
system that will facilitate the automatic classification of messages and conversations.
The following report details the process of research, analysis and system development concerning the
classification of USENET messages for a conversation visualisation system. However, in order to
provide some context to the focus of this report, it is first necessary to understand more about
USENET itself.
8/4/2019 MATLAB Based Visual is at Ion
8/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
7
1.2 A H istory of USEN ET
Throughout human existence, a gifted few visionaries have influenced thinking and furthered the
development of our species one might include Darwin, da Vinci or Einstein in this group, but in
recent times one might consider the relatively unknown JCR Licklider. In his and Robert Taylors
The Computer as a Communication Device (1968) 25, Licklider foresaw many innovations which today we
take as granted. Innovations such as team working via computers distributed over great distances,
and video conferencing. It was when he became Director of the Information Processing Techniques
Office (IPTO) at the Pentagons Advanced Research Projects Agency (ARPA) that he put in place
the financial priorities which would eventually lead to the development of what we know today as
The Internet. Effectively he held the purse-strings, but at the same time created a working
environment where graduate students ran a multi-million dollar research project.
The ARPANET developed out of this endeavour in the late 1960s and was originally intended as a
military resource for sharing supercomputers across the United States. Academic institutions soon
saw the benefits of this network for research and the sharing of knowledge, and in 1969 researchers
at four universities connected their individual campuses up to ARPANET, becoming the first
hosts on the network. Within a couple of years, most US Universities were connected and a
phenomenon known as e-mail had become the most popular application of the network Over the
following 30 years, the network grew exponentially and various protocols were developed for
particular types of communication between computers. Protocols such as the Transmission ControlProtocol/ Internet Protocol [TCP/ IP] (the basis of the internet), and the Unix-to-Unix Copy
Protocol [UUCP] upon which USENET was developed.
USENET was born in 1979 when Duke University graduate students Tom Truscott and Jim Ellis
implemented an idea they had about linking discussions within the UNIX community37. Using the
UUCP they managed to get computers to communicate via auto-dial modems: Steve Bellovin of the
University of North Carolina created scripts which meant computers could automatically dial each
other up and search for changes in date stamps of files if these date stamps were different, then the
files were copied from one computer to the other. Truscott and Ellis described their ideas and plans
for construction at the January 1980 USENIX (UNIX Users Group) conference; they emphasised
the collaborative nature of their efforts and welcomed a collaborative development process. These
basic concepts were built upon (principally by Stephen Daniel, a fellow Duke graduate) and
developed using the C programming language. His program for accessing USENET, called A
8/4/2019 MATLAB Based Visual is at Ion
9/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
8
News, debuted at the summer 1980 USENIX conference. USENET was seen to be an alternative
to the expensive clique-like ARPANET - Essentially, Truscott, Ellis and Bellovin created a poor-
mans ARPANET an electronic community created cheaply and without the political problems
associated with the actual ARPANET, which grew exponentially over the following years14.
USENET developed into one of the largest asynchronous communications mediums ever created: It
is communication structured in turns the major benefit of this being that people from across the
globe can discuss or collaborate on a particular topic regardless of the restrictions of daily schedule or
time zone. Essentially it is an informal worldwide network of discussion groups called newsgroups
a distributed database of messages, passed between clients and servers using the Network News
Transfer Protocol (NNTP which superseded UUCP). Because of its distributed nature, there is no
central authority dictating boundaries or policing the USENET. By technical definition, anarchy
exists, however order and structure do exist. USENET is organised into subject hierarchies with the
first few letters of the newsgroups name indicating the major subject category, with subtopic names
representing sub-categories: For example rec.music.jazz represents recreational (rec) discussions
about music (music), with the further specific sub-topic about jazz music (jazz).
Figure 1:A typical N ewsgroup Reader - messages are downloaded from a news serverover N N T P. O n the left is a list of subscribed newsgroups, with the top frame depictingthe subject lines of messages and threads within that group. Below is a preview panedisplaying the selected message.
8/4/2019 MATLAB Based Visual is at Ion
10/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
9
Conferencing systems, of which USENET is the largest example, are essentially a refinement on e-
mail discussion lists a series of messages strung together and related through similar topics. The
distinction can be made though, with the fact that e-mail is a push technology (people receive it
regardless of whether they asked for it or actually want it), whereas newsgroups are a form of pull
technology where users can select the groups and messages they want and request only those groups
or messages. Figure 1 depicts a typical Newsgroup reader downloading messages from a news server
using NNTP. A more recent alternative to accessing USENET exists in the form of Google Groups
who provide a fully searchable archive of messages dating back to 1981, displayed in HTML format
(see Figure 2). Google also provide a good overview of the associated terminology of USENET via
their website11. This more user-friendly interface for USENET is a useful starting point for some
initial observations into the structure and content of messages, and the language used in newsgroups,
which will provide a basis for more in-depth investigation in the sections to come.
Figure 2:A view of "rec.music.jazz" from the Google Groups website (http:/ / groups.google.com/ ) whichprovides searchable, archived newsgroups browsable in H T M L format
8/4/2019 MATLAB Based Visual is at Ion
11/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
10
2 . S Y S T E M S T O V I S U A L I S E C O N V E R S A T I O N
The soul never think s without an image
ARISTOTLE
2.1 Visualisation D es igns
Having established an understanding of USENET, it is time to investigate what is essential in the
design of a system which can be used to classify text for use in a conversation visualisation system. It
is first pertinent to identify the relevant research and projects already undertaken in relation to this
topic. By examining the state-of-the-art it may be possible to further enhance or combine
techniques for the purposes of this project, and any future implementations.
This section attempts to compare and contrast the different systems which already exist, both in the
field of text classification and visualisation systems, and will distinguish the key areas related to this
project. Armed with this knowledge it should be possible to proceed and develop a method (or
system) for classifying the text of USENET messages.
It is perhaps worth mentioning that a good overview of designing graphical representations for
persistent conversations can be found in the paper V isualising Conversation 7 written by Judith
Donath, Karrie Karahalios and Fernanda Vigas.
8/4/2019 MATLAB Based Visual is at Ion
12/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
11
2.1.1 Loom
The original Loom project began as a class project within MITs Sociable Media Group, and was
taken on as a project by Graduate student Karrie Karahalios 19, 20. The underlying vision for the
project was the development of a tool for visualising Usenet newsgroups. This was to be achieved by
observing patterns in key events of the newsgroups: Such events included the beginning and end of
conversation threads, the tone of messages, the entrances and exits made by participants in
conversations as well as observing the path traversed by users as they create their social fabric 20
(the concept of fabric being also used in a message thread visualisation metaphor).
The first implementation of Loom focused on visualisations of the mood or tone of the messages.
The system itself has a simple structure: Newsgroup messages are collected then filtered through
what Karahalios calls a general classifier. The designated mood clusters arising from this classification
are then represented visually. (For the purposes of this system, four clusters comprising angry,
peaceful, news and other were chosen these are consequently represented in the visual
output [Figure 3]). Clicking on each individual container enables the user to view the message being
visually represented. Karahalios admits the limitations of the system in its infancy (please refer to the
Loom website 20), and apart from the obvious display improvements she recommends further work
on classification, in particular breaking down the categories further.
Subsequent implementations focused on abstractions of message threads, and how they can be
represented visually. Loom 2 is an evolution of the Loom project and is overseen by Karahalios
supervisor for Loom, and prominent SMG professor, Judith Donath. It was Donaths article in the
Communications of the ACM 8 which provided a starting point for this project. In it she notes how
the nature of Usenet makes it a vast source of information, but the current means of accessing it lack
visual appeal, and tend to obscure many of the cues that aid us as human beings in our social
interactions. Donath maintains that more focus on theses cues, coupled with improved visualisation
could help the viewer perceive the online space more intuitively, and create a legible social
environment within which they can interact. Ultimately this is the aim of the Loom2 project: To use
the salient features of social interaction to create a legible interactive visualisation of Usenet.
8/4/2019 MATLAB Based Visual is at Ion
13/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
12
Figure 3: T he Original "L oom" V isualising Mood. It is possible to identify the four classification clusters ofangry, peaceful, news and other, represented here by the red, green, yellow and blue colours respectively.T his implementation displays the individual postings as coloured containers over a grid which represents acalendar.
In order to achieve this aim, the project members set themselves numerous questions to answer in
the development of their system principally based in domain of social structure and activity. The
focus of the Loom project being visualisation to aid social interaction, the data itself acted only as a
starting point. At a fundamental level, Looms data gathering technique comprises a subset of
newsgroups collected from a news server and parsed, then stored in an Oracle database. This was
just a small sample of Usenet and was restricted to only English newsgroups with a minimum
number of posts in the last month.
Another visualisation project developed by Donath, and Rebecca Xiong at MIT, is PeopleGarden 48.
This system uses the metaphor of flowers in a garden to represent people and their posts within a
newsgroup (to create what they term data portraits). The abstract representations of a users
8/4/2019 MATLAB Based Visual is at Ion
14/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
13
interaction history are illustrated in Figure 4: Two groups are represented, the left-hand side being a
group with one dominant voice, the right being a more evenly distributed, democratic group.
Figure 4: "PeopleGarden". E ach participant is represented by a flower. T he stems of the flowers representlength of time in the newsgroup, the petals represent init ial posts (blue) and replies (pink ). T he older posts arerepresented by less intense shades. T his figure represents two groups on the left hand-side a group with a singledominant voice, on the right a more democratic group.
PeopleGarden provides a good example of a visualisation metaphor, based upon activity and
participation of the users (much like L oom), but again it could be argued that more focus is needed
on the analysis of the content of messages, the text itself rather than just participants, in order to
provide another level of the context that Donath et al are seeking. More information about
PeopleGarden can be found in Xiong and Donaths paper, and at the projects website 48.
8/4/2019 MATLAB Based Visual is at Ion
15/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
14
2.1.2 Code Zebra
CodeZebra4 is an ongoing, international research project, attempting to develop an entirely new kind
of chat environment in visual 3-D space. In a similar vein as Loom, CodeZebra depicts navigation
and relationships between groups, individuals, topics and conversations, using a central concept of
animal print metaphors. Indeed, CodeZebra owes much to the natural world for this as users can
play games with each other, create conversational histories and generate their own patterns based on
metaphors which permeate the CodeZebra world. For example, users who post long, considered
responses might be classified as zebras, a typically academic animal, whereas users who tend to
frequent the same threads as others, and perhaps pick on other users might be classed as hyenas
typically pack-hunters in the real world. Similarly there are classifications of butterfly, ocelot,
cheetah and peacock amongst others, each with their own unique characteristics.
The need for this system originates from the inherent weakness of existing chat mediums once a
large number of people interact with each other they generate a lot of text but it is all 2-
Dimensional and the social and emotional context is sometimes unclear or lost altogether. The aim
of CodeZebra is to create a dynamic visual depiction of the underlying associations between issues
and topics, through providing interactive games and discussion structures. The 3-D environment
attempts to create a visual guide to what is being said, by whom and with what emotional tones. Part
of the driving force behind CodeZebra is the bringing together of the worlds of Art and Science in
essence, the project creators say, artists construct metaphors yet there are problems of representationwithin science therefore a move away from realism and further into abstraction is necessary to aid
communication between these disciplines. The metaphor of patterns is also important as they show
links between topics, emotional characteristics of topics and show relationships between postings.
A graphical explanation of CodeZebra second prototype can be found below in Figure 5, and the
software itself can be accessed online at the CodeZebra website4. The third prototype has just been
released, and the system underwent a globally interactive road test during the recent Dutch
Electronics Arts Festival (DEAF) in Rotterdam, the Netherlands3.
Head of the project is Sara Diamond from the Banff New Media Institute1 in Canada, and research
partners include V2 Labs (NL)43, the SMARTLab Centre (UK)34 and the University of Surrey (in
particular Dr. Andrew Salway). CodeZebras links with the University of Surrey, in particular with the
supervisor of this project, will facilitate the sharing of information and the spirit of research
8/4/2019 MATLAB Based Visual is at Ion
16/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
15
collaboration in the evolution of this project. With CodeZebras emphasis being on visualisation,
there is ultimately an essential need for the underlying system to accurately analyse and classify
messages correctly. Fundamentally, this is where it is hoped that this project will contribute to this
field of research.
Figure 5:CodeZebra Prototype 2 ex plained
8/4/2019 MATLAB Based Visual is at Ion
17/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
16
2.1.3 Conversation Map
The Conversation Map system is the brainchild of Warren Sack and formed the basis for his PhD
thesis30 at the Massachusetts Institute of Technology (MIT) in 2000. The system attempts to map
message threads alongside Social and Semantic Networks in the same application window (see Figure
6).
In Discourse Diagrams: Interface Design for Very Large-Scale Conversations 31, Sack classifies the
internet conversations taking part between many thousands of people at any one time as Very
Large-Scale Conversations. He alleges that it is the sheer size and variety of these conversations,
such as those conducted within newsgroups, which render them difficult to understand or critically
reflect upon for interested observers and participants alike. The system attempts to facilitate greater
understanding through creating visualisations of the Social and Semantic networks in action in a
particular thread of messages. In the aforementioned paper, Sack discusses the design criteria
necessary in order to transform these social scientific representations into what he terms interface
devices. The Conversation Map system he developed is such a device, and is essentially a browsing
system for Very Large-Scale Conversations (VLSCs).
A Social network involves sketching vertices between a set of nodes labelled with the names of
people, with the vertices between the nodes representing interaction between these people. In a
similar manner, Semantic networks involve nodes and vertices, with the nodes labelled with words or
concepts and the vertices representing the semantic relationship between those words or concepts.
Social and Semantic Networks are usually used in the social sciences as analytic devices - as scientific
models/ representations of hypotheses what Sack proposes is that they be used as generative
devices; i.e. as a means of exploring VLSCs by being an interface into the archive of a discussion 32.
Effectively, this means they are diagrams of discourse.
8/4/2019 MATLAB Based Visual is at Ion
18/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
17
Figure 6:T he Conversation M ap system: The upper left part of the screen shows the social network present withina group, the upper right shows the semantic network . I n-between is a list of themes present in the group and belowa representation of the message threads within the group.
Conversation Map can be used in a similar way to regular newsgroup readers like Eudora, Agent and
Outlook Express, but has features which others do not provide. It performs a series of
computational linguistic and sociologic analyses on the messages of the newsgroup, and presents its
results in the form of a graphical interface (see Figure 6).
In discussing his work (at an HCI Seminar 2000-2001 held at Stanford 32) Sack notes how a
newcomer to a VLSC might want access to previous discussions as well as the social layout of the
group in order to determine which participants and threads are relevant to the newcomer. This also
distinguishes which topics or themes bring together participants, or divide them as the case may be.Conversation Map does this by focusing more on summarisation, navigation and browsing rather
than information retrieval.
8/4/2019 MATLAB Based Visual is at Ion
19/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
18
2.1.4 N etscan
Netscan began in 1994 as a graduate project at the University of California Los Angeles (UCLA)
conducted by Marc Smith. Today it continues in the industrial research arena under the banner of
Microsoft Research. Compared to Loom and Conversation Map, Netscan has a much more
grandiose aim to map the entirety of Usenet.
Smith, Fiore and Tiernan10 are quick to note that current newsgroup browsers offer very little (if any)
information about the history of authors in a newsgroup for example their activity, what other
newsgroups they post to or which other participants they converse with etc. They argue that it is
these features which, if they were available in browsers, would provide a guide to social context and
interactional history that is typically used by participants in physical spaces. Current browsers present
information about the messages themselves (subject line, posting date, and lines) which causes the
user to focus their attention on the structure of the medium itself, rather than the qualities or value
of the participants.
One of Netscans aims is to develop a newsgroup interface which allows the user to sort and search
by salient behavioural features such as the fraction of each authors messages that were initiating
conversation as opposed to replies, or the number of days on which each author contributed a
message to that particular newsgroup. By combining these features with the regular features which
depict structure and thread development, Smith, Fiore and Tiernan believe they can extract valuable
content out of the large social cyberspaces that comprise Usenet.
Netscan connects to a news server that carries nearly 50,000 Usenet newsgroups and collects all of
the messages in all the newsgroups. Information is gathered from message headers and
stored/ maintained in a database. The analyser part of Netscan is then able to read selected
portions of the database and generate reports and analysis of selected newsgroups over a specific
period of time. A demonstration of this can be found on the main page of the Netscan website 35.
The Netscan website demonstrates a few of the tools/ interfaces that Smith has developed in order totry and map Usenet, and Figure 7 depicts a boxplot representation of the rec.musichierarchy.
8/4/2019 MATLAB Based Visual is at Ion
20/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
19
Figure 7:A N etscan boxplot of " rec.music": T his shows the sub-groups present within the rec.music hierarchy; the siz eof each sub-group represents the number of posts to that group in the last month. T he colours represent the increase(green) or decrease (red) in posting within that group compared to the previous month the intensity of the colourindicates the level of increase/ decrease.
It could be argued that Netscan, like the systems mentioned already, focus upon the more
sociological/ visual representations of conversations whereas there needs to be more research into
automatic text analysis incorporating the content of the message as well as the features that
Netscan has already extracted perhaps. Marc Smith notes how Netscan has succeeded in its initial
aims, but that there is also scope for improvement with regard to content analysis36; Smith suggests
that content analysis of message bodies could lead to the mapping of the diffusion of topics across
USENET, and indeed other communications media. He further suggests that this could possibly
assist studies into informal communication networks and the transmission of folk beliefs, as well as
the development of academic disciplines.
8/4/2019 MATLAB Based Visual is at Ion
21/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
20
2.1.5 Visualisation Designs: How they compare
At a glance, here is how the various visualisation systems covered in the previous sections compare
by feature. It is clear that they all offer something unique, yet there is a central theme to what they all
require more investigation into automatically classifying text.
Loom
PeopleGarden
CodeZebra
ConversationMap
Netscan
Participation analysis? ! ! !
Semantic Structures Represented? ! !
Social Structures Represented? ! !
Message Information Extraction? ! !
Message Content Analysis? ! !
Author Profiling/ Representation? ! ! !
A utomaticclassification of text?
Requires more research into automatic text
classification?! ! ! ! !
8/4/2019 MATLAB Based Visual is at Ion
22/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
21
2.2 Methods for Classifying Text
It is clear from the previous section that visualisation systems are plentiful, and vary greatly in how
they visually represent conversations. Before being able to visualise such conversations, it is necessary
to impose some form of classification system with regards to the content of messages or
conversations. Some initial work has been conducted into the area of automatic text classification,
and the following section will introduce a number of different methods which could potentially be
used or adapted for use in a future automatic text classification system for conversation visualisation.
Firstly, a rule-based method of classifying text exists in the form of Ellen Spertus Smokey, a
system which automatically classifies abusive flame messages. Next Teuvo Kohonens
WEBSOM will be covered; this is a method for organising text documents and preparing visual
maps of them to facilitate information retrieval this method is based on the Self-Organising Map
(SOM) data processing method which has its roots in neural computing. Finally, a look at George
Karypis CLUTO clustering toolkit. This software package enables the user to cluster low and
high-dimensionality data, and in its recently released form of gCLUTO provides a user-friendly
method for analysing the characteristics of various clusters. From a more detailed look at these
methods and systems, it should be possible to draw inspiration for a future system to classify
USENET messages for a conversation visualisation system.
8/4/2019 MATLAB Based Visual is at Ion
23/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
22
2.2.1 Smokey
Smokey was the brainchild of MIT graduate Ellen Spertus38. The system focuses upon abusive
email messages, otherwise known as flames. The system aims to identify abusive, insulting or
offensive messages by looking for specific words and their context and syntactic constructs. The
prototype system builds a 47-element feature vector based on the semantics and syntax of each
particular sentence. It combines these vectors for sentences within a message, so that effectively the
resulting vector represents all the constituent parts (sentences) of the message.
The messages are converted to one sentence per line and delimiters are put between messages
(manually). The resulting text is then run through a parser developed by the Microsoft Natural
Language Processing Research Group38, and the parser outputs converted by sed and awk scripts
into Lisp s-expressions. The s-expressions are then processed through a set of rules developed in
Emacs Lisp, producing the aforementioned feature vector for each message. The feature vectors
were then evaluated using simple rules produced by Quinlans C4.5 decision tree generator.
Spertus reports that with a training set of 720 messages, and using the process detailed above, the
system was able to correctly categorise 64% of flames and, in a separate 460 message set, 98% of
non-flames. It could be argued that this success rate could be improved, and Spertus acknowledges
the limitations of Smokey with regards to elements such as grammatical errors, the use of innuendo
and the use of sarcasm. For example Smokey could mistake the statement Im glad to see that your
incessant name calling and whining hasnt stopped as a praise rule because it starts with Im glad.
The messages used in the prototype came from the webmasters of a number of controversial
websites, including The Right Side of the Web (a conservative resource), NewtWatch (a site which
criticises Newt Gingrich) and Fairness and Accuracy in Reporting (FAIR, a media watch group best
known for questioning the claims of Rush Limbaugh). All of the messages from these sources were
in the form of email messages Spertus recognises that there are many different types of flames in
real-time communication; as facilitated by the WWW, mailing lists and Usenet newsgroups. However
she also notes that publicly posted messages, such as those found in newsgroups, tend to be more
clever and less personal (i.e. more indirect) than the private email flames (which Smokey focuses
upon) thus, Spertus claims, this makes the publicly posted flame messages harder to reliably
detect. First instincts might tell a researcher to look for obscene expressions or vulgarities within the
8/4/2019 MATLAB Based Visual is at Ion
24/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
23
messages, to identify them as flames. However, Spertus found that only 12% of the flames actually
contained vulgarities and over a third of the vulgar messages were not actually flames.
The most important element of Smokey is the set of rules being used in order to process the textual
input and classify the output. These are listed in full in Spertus paper38, but examples include the
likes of Imperative statements containing Look (e.g. Look forward to hearing from you), and
phrases containing a negative word near you. The clear advantage of a method such as this is that a
set of rules are relatively easily converted into a programmable algorithm. However, the scope for
such rules would depend upon an in-depth analysis of USENET messages to determine if any
patterns or features exist which could potentially be represented in rule form, and coded into a future
system
8/4/2019 MATLAB Based Visual is at Ion
25/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
24
2.2.2 WEBSOM
The Self-Organising Map24 (or SOM in WEBSOM) was developed by Teuvo Kohonen at the
University of Helsinki, Finland. This is a method which facilitates the automatic organisation of
complex datasets, and provides a means of creating visual maps of these datasets to aid in the
retrieval of information. It has also been the basis of more than 3000 published studies worldwide,
with applications ranging from image analysis and speech recognition to medical diagnoses and
telecommunications applications.
With WEBSOM45, documents are automatically arranged into a visual map using the SOM method.
Before this can take place, the textual content of the documents is converted into numerical form
several hundred numeric values correspond to each document. A certain combination of these values
within a document corresponds to a word; hence if the size of this combination is larger in the code
of the document, then the word occurs more frequently. Words are also weighted according to how
common they are. A pre-processing stage removes punctuation marks, numbers and other non-
standard characters, and the most common and most rare words are excluded.
A zoomed view of WEBSOM can be found in Figure 8. Here, the user has already clicked on the
map image with their mouse. This zoomed view shows white points or nodes which allow the user
to investigate the contents of individual map units (threads of messages). The Arrow button in the
top left-hand corner allows the user to move to a nearby cluster. The map image contains labels
which are example of the core vocabulary of the area in question. The labels give a general idea of the
topics in the document collection. The colouring of areas of the map represents the density of
documents in that area. Light areas contain more documents.
As the WebSOM site depicts, this tool is especially good with large bodies of text such as USENET,
and has great potential in the classification of USENET messages right down to individual words.
With the SOM base, there is also the potential for it to be adapted into a system incorporating an
adaptive/ learning neural network which could dynamically adapt to the ever changing environment
of USENET.
8/4/2019 MATLAB Based Visual is at Ion
26/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
25
Figure 8:"W E BS OM ": A zoomed view - E ach white dot is a map node. Colour denotes the density or the clusterintendency of the documents. W hite areas are clusters and dark areas empty space, "ravines", between the clusters. Thon the right hand side denotes which topics are being represnted.
ge tex t
8/4/2019 MATLAB Based Visual is at Ion
27/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
26
2.2.3 CLUTO
This software facilitates the clustering of low and high-dimensionality data, and has application areas
including customer purchasing transactions, Geographic Information Systems (GIS), general science
and information retrieval. In the sense of information retrieval, clustering is based on the idea that
objects (usually documents, but in this case keywords) clustered together that is, closely associated
with each other tend to be relevant to the same query or topic domain. Formally, this is known as
the Clustering H ypothesis:
Closely associated documents tend to be relevant to the same requests 29
The CLUTO21 package includes three different classes of clustering algorithms which operate either
in the objects similarity space or in the objects feature space. These algorithms are based on thepartitional, agglomerative and graph-partitioning paradigms. Fundamentally the software package
allows a user to input data in the form of a matrix the software then processes this data into
clusters and enables the user to analyse the discovered clusters through looking at the relations
between objects within each cluster and the relations between different clusters. CLUTO is
particularly good at identifying the features which best describe or distinguish each cluster. Based
upon studies13 and 22 conducted into the effectiveness of the algorithms, it is generally held that
partitional clustering offers the best results.
CLUTO is freely available to download21and is available for Linux, Sun and Windows platforms (the
current version is 2.1). The package features two programs, vcluster (which takes as an input the
multidimensional representation of the objects that need to be clustered) and scluster (which takes
the similarity matrix/ graph between these objects as its input). Essentially both programs perform
the same functions, and are invoked via the command line using the following call:
vcluster [optional parameters] MatrixFileNClusters
WhereMatrixFile is the name of the file that stores the n objects to be clustered. With vcluster, each
one of those objects is considered to be a vector within m-dimensional space, and the collection of
these objects is represented in an n x m matrix.N Clusters is the number of clusters that are required
(default is usually 10). The optional parameters which can be specified are given in detail and
8/4/2019 MATLAB Based Visual is at Ion
28/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
27
explained in the CLUTO manual these include directing output to named files, setting different
clustering methods and number of clusters etc.
Not only is CLUTO a powerful and free piece of software, but it can also be used within bespoke
applications thanks to its availability as a set of C libraries. CLUTO can effectively become an
integral part of a larger information retrieval, or text classification system.
As of February 2003, Version 0.5 (Alpha release) of gCLUTO 21 became available. This is a GUI-
based version of CLUTO intended to make the clustering toolkit more user friendly (see screenshot
in Figure 9). Essentially this
eliminates the command line-
based operation of CLUTO
and turns it into a point-and-click interface. The command
line input depicted earlier is
simply replaced by a standard
Windows-style GUI Browse
button for input files.
Even more recently, a web-
based version was released:
wCLUTO21. This is a web-
enabled data clustering
application designed for the
clustering and analysis of gene-
expression datasets, and is
based on the CLUTO toolkit.
Figure 9:gCL U T O: A graphical clustering toolk it based on CL U T O.H ere we see the clustering solution at the top of the screen with the M atrixand M ountain visualisations of these clusters at the bottom left and rightrespectively.
Clearly from all of these different implementations, and the fact that CLUTOs libraries can be
accessed from within a bespoke C/ C++ program, this toolkit is highly versatile and could easily be
incorporated into a future implementation of a system to automatically classify text for a
conversation visualisation system. This is not only a very powerful statistical analysis tool, it also has
massive potential as part of a future system real-time clustering of text would be of immense use in
identifying text with similar/ dissimilar characteristics (and therefore the classification of that text).
8/4/2019 MATLAB Based Visual is at Ion
29/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
28
2.3 Discussion
In this chapter the state-of-the-art in visualisation systems has been characterised. What is apparent
from current systems is that the focus so far has generally been biased towards the actual
visualisation requirements of the system, rather than the underlying processing and pre-processing
needed in order to classify text for use in such a visualisation system. Donath and her contemporaries
at MIT have tended to focus more on the social context of the interactions involved in online
communication, as has Warren Sack to a certain extent. Others such as Smith et al have focused
more on the analytical aspects of such a system, yet there seems to have been little (if any) research
into the bare-bones computational analysis aspects of a visualisation system. To a certain extent, the
CodeZebra team are now pioneering this work, but it is hoped that this report will also be of value in
this field.
When considering this topic from a computational analysis viewpoint, it is becomes clear that to
successfully analyse the text involved (in order to classify it), one must have an idea of some
techniques for analysing text. Therefore, rather than attempting to reinvent the wheel it is only
practical to assess the current state-of-the-art with regard to text classification as well. Section 2.2
described three different techniques and systems for text analysis, the rule-based Smokey, the neural-
network inspired WebSOM and the clustering toolkit of CLUTO. The example set by Smokey is an
interesting one, and is a good foundation upon which to base a simple first system. WebSOM and
the possible use of neural/ Bayesian networks in a future system would be an interesting andrewarding avenue of research, but is beyond the scope and timescale of this particular project.
However, the CLUTO toolkit has great potential, and in the case of this project gCLUTO would
provide an excellent means of analysing USENET text. As such, implementing a fully functional
system using CLUTO libraries is probably also beyond the limitations of this project.
In conclusion, it would be worthwhile attempting to develop a simple rule-based classification system
as a basis for this project, with the possibility that elements of CLUTO could be incorporated in
some way. Future implementations could implement CLUTOs libraries to provide on-the-fly
clustering to aid classification, and similarly some form of Self-Organising Map might alternatively be
used in a learning system. Before any system can be attempted, it is essential to conduct some in-
depth analysis of USENET messages to see if any patterns or heuristics exist, and whether any rules
can be derived from them. This analysis is described in the following chapter.
8/4/2019 MATLAB Based Visual is at Ion
30/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
29
3 . F E A T U R E S F O R C L A S S I F Y I N G U S E N E TC O N V E R S A T I O N S
Y ou and me are just different points of view
CHANG-TZU
3.1 Introduction
It has been established that current visualisation systems are sophisticated, yet little research has been
conducted into the underlying system which classifies text for those visualisations. After
characterising the fields of visualisation systems and text classification methods, it is important to
realise that any potential system development begins at an atomic level - with the text itself.
In Chapter 1, the immense size and scope of USENET was described along with its historical
context. With such a large source of data, it is necessary to form some idea of the structure and
specific features which exist within newsgroups. To facilitate this understanding, some initial
observations of USENET are documented in Section 3.2 this aims to determine what key features
possibly exist, and which features might be worthy of further in-depth investigation.
Following this period of observation, a corpus of USENET messages was collected at random and
analysed using the processes documented in section 3.3. The final sections of this chapter deal with
the statistical analysis of the corpus, ending with the formulation of some example rules for a
potential system implementation.
8/4/2019 MATLAB Based Visual is at Ion
31/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
30
3.2 Init ial Observations of USEN ET N ewsgroups
In order to get a general feel for USENET messages and their constructs, a period of Initial
Observation of USENET messages was undertaken. This consisted of examining various
newsgroups and messages contained within them in an attempt to identify key (or simply interesting)
features existing in these messages. For
the purposes of these observations,
Google Groups provided an effective
way of navigating and reading groups
and messages. Google Groups archives
USENET in its entirety back to 1981,
and makes groups and articles
(messages) available in HTML format,
browsable and searchable via their
website11. Originally Deja News created
this user-friendly interface for USENET
discussions in 1995; it was later acquired
by Google Inc. in 2001.
Initially, the most striking aspect of
USENET is the variety of groups whichexist, each one facilitating an even more diverse range of topics (Figure 10 shows the Google Groups
listing for the most popular top level domains there are hundreds more top level, and sub-level
groups). This variety is also reflected in the composition and structure of the messages themselves.
Posts can range from simple one-line questions with short replies, basic statements of opinion or
thread-based opinion polls to lengthy, thought-out responses which emphasise the knowledge and
writing style of the author. Frequently, message replies quote the original or previous message in their
body, some people preferring to reply after the quote, some before, and some taking the message one
section at a time by interspersing their replies with quoted text.
Figure 10: T he Google Groups main page athttp:/ / groups.google.com/ . T his site was used for initialobservations of USE N E T newsgroups and the messages withinthem. G oogle Groups archives all of USE N E T back to 1981 more than 700 million messages
In the more free-spirited discussion groups, personal opinions are highly valued by authors, with
subsequent personal insults being the reward if an authors views/ opinions are disagreeable to
another author. Such insults or flames are characterised by the use of profanity and capitalised
words (which indicate shouting in the context of online conversation). Another feature which is
8/4/2019 MATLAB Based Visual is at Ion
32/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
31
specific to communication over the internet is the popularity of acronyms and abbreviations for
popular or frequently used words and phrases, sometimes known under the umbrella term
Cyberspeak. Some examples of these include Laughing-Out-Loud (LOL), Rolling-On-The-Floor-
Laughing-My-Ass-Off (ROTFLMAO), and By-The-Way (BTW). These features are generally net-
specific so would be an interesting feature to investigate further with respect to this project, to see
how widespread their usage is and if they can be used to classify text.
Also notable are the representations of emotion in the form of emotional icons or emoticons.
These are typically constructed from two or more punctuation symbols, for example a colon a dash
and a closing bracket represents a smiling face. These features are also specific to online
communications, and as they provide some representation of facial expressions or facial actions (in
this non verbal medium), then they could also be indicators of emotion or mood.
In some cases, there are also textual representations of expressions, such as grinning (*g* or for
example) which might also be worth investigating. The use of asterisks (*) around words to
provide emphasis is also a feature for example:
The Beatles were ok, but I *really* lik e the Rolling Stones music
Sometimes this is used interchangeably with all-capital letters in the context of online conversations
this represents shouting or heavy emphasis. Coupled with this it might be worthwhile investigating
the use of exclamation marks and other punctuation, to see if these can aid the classification of text.
With regard to the overall content of USENET, the newsgroups concerning more academic or
technical subjects tend to be populated by a more civilised readership; yet sometimes certain
subjects are inaccessible to visiting outsiders due to their limited knowledge in that domain.
Responses and initial posts in groups like these are usually well structured and fairly detailed
indicated by sentence and paragraph length and structure.
With close-knit groups, such as highly technical academic groups, there is an element of prior
knowledge of subjects which facilitates the telling of jokes based around that knowledge. Groups
such as this, where posters are generally known to each other or articulate their views soundly, are on
the whole more civil than other groups (where for example, a new poster might get abusive replies
for asking a valid question). See also section 2.2.1 regarding Smokey38, a classification system
developed to aid the identification of these abusive flame messages.
8/4/2019 MATLAB Based Visual is at Ion
33/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
32
Each message posted to USENET also has a unique structure identifying it. This is in the form of
header information which describes when the message was posted, the server it was posted to and
uniquely identifies the user who posted the message. Another feature of messages, this time
dependent on the individual user, is the signature which users sometimes attach to their messages.
These are usually text-based and can provide information about the poster (i.e. email, website or
telephone information) but can also sometimes include elements of ASCII art pictures made up
from standard ASCII characters. Another popular trait is to include a favourite quote in signatures.
When users reply to messages, the previous message is often quoted to provide context to the reply
(the standard message quotation starting with the > symbol, multiple quotes resulting in many of
these). Such replies might start with a question or include questions, with some replies containing
URLs referring the reader to extra information on websites. However, some threads, whilst lengthy
sometimes hundreds of messages long can tend to be just idle chit-chat between two or more
posters.
There are also more specific linguistic features worth investigating the use of imperative statements,
(L earn x before you talk about it or Get with the program!) which also links to the use of personal
pronouns. Directly addressing a fellow user as you could be an indicator of response tone (this
could depend upon some understanding of the context of the message or conversation).
From the features detailed above there appear to be a number of potential phenomena to visualise,
some of which have been attempted by a few of the systems mentioned in section 2. These include
conversational tone, mood, emotion, themes across groups, contributor information (dominance,
reliability, links to others, activity etc) and issues of formality or informality. The scope of these
phenomena, and the features to identify them, is potentially huge, therefore a select few of these
features will be examined when analysing the corpus of USENET messages.
8/4/2019 MATLAB Based Visual is at Ion
34/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
33
3.3 Statistical Corpus Analysis
The aim of this stage of analysis is to attempt to determine if features which have been identified
offer an insight or indication of how text can be classified, by the derivation of some set of rules
based upon any patterns (or not) which exist within the corpus. This insight might be supported by a
significant distribution of the cue word across the sample corpus for example. The process used is
documented in the following sections, but first it is necessary to describe the origin of the sample
corpus being used in this project.
The University of Surrey News feed (ref news.surrey.ac.uk) was used in order to collect a corpus of
messages with which to conduct this research. The Tin42 newsreader under UNIX has functionality
which allows the archiving (saving) of all messages within subscribed newsgroups to a specified
location, and so was utilised for this project. For the purposes of this analysis, newsgroups were
picked at random from a variety of different topics. The list of archived newsgroups included
alt.book s, alt.music, alt.music.lyrics, alt.music.smash-pumpk ins, alt.politics.democrats, alt.politics.usa,
alt.politics.usa.republicans, sci.environment and uk .media. Tin stored each newsgroup in a folder named after
the group and each article (message) was uniquely numbered and stored in that folder.
The corpus amounted to a total of 6005 messages, containing 4,303,358 words. It should be noted
that message header information (information used by mail servers and clients to identify and route
messages) was also included in this count however, even if header information accounted for a
third of this number, the total massively exceeds the anticipated minimum of 250,000 words stated in
this projects Inception Document.
Using System Quirk39, a language engineering workbench/ tool developed by the University of
Surrey, it was possible to determine the frequency of certain words within the corpus (Using the
Kontext tool). These keywords were specified in a separate file known as a Word List. Kontext
allows various functions to be performed on a corpus of text, from Indexing (counting the frequency
of words) to the more complicated Keyword In Context (KWIC) function which allows a user-
specified number of words either side of the keyword to be displayed (thus, displaying some of the
context in which the word is being used). For the purposes of this project, Kontexts frequency
counting functionality will be employed.
8/4/2019 MATLAB Based Visual is at Ion
35/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
34
As a precursor to this analysis, Monday 27th and Tuesday 28th January 2003 saw a collaborative
workshop take place between myself, Sotiris Rompas (UniS MSc Student), Dr. Andrew Salway (UniS)
and Erik Kemperman, an AI Programmer from V2 Labs43 in the Netherlands, and a key member of
the CodeZebra project team (see Section 2.1.2). The idea behind this workshop was to share ideas
and get a feel for possible classification cues existing in the text of USENET messages. For this
purpose, the corpus created for this project was used. Full details of the work undertaken were
documented23, and included some basic work with cyberspeak and personal pronoun wordlists, and
also a look at the theme of agreement and disagreement using a wordlist comprised of synonyms of
agree and disagree obtained from Princeton Universitys WordNet thesaurus45.
The following subsections continue and expand upon the initial groundwork established in the
workshop by examining the features of cyberspeak, personal pronouns and synonyms of
agreement and disagreement in greater detail through a comprehensive analysis of the entire corpus
(using System Quirk) and performing further clustering analysis using CLUTO. The analysis process
is detailed in the next section, with the following sections dealing with the individual phenomena to
be analysed.
8/4/2019 MATLAB Based Visual is at Ion
36/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
35
3.3.1 The Analysis Proces s
In order to glean meaningful results from the data produced by System Quirk, the results (i.e
frequencies produced by indexing) were transferred into a Microsoft Excel spreadsheet. With figures
for the number of words within all messages in each group, it was also possible to calculate relative
frequencies of the keywords. All of this data is tabulated at the end of this report in Appendix 1, and
should be referred to as a means of complementing the discussion of the following sections.
As described earlier (in section 2.2.3), the CLUTO clustering toolkit offers many functions and is
extremely versatile, therefore the decision was made to utilise some of the basic functionality of this
package in order to analyse the data obtained in the first stage of analysis. In theory, clustering these
results should enable us to discern whether or not the keywords contained in the System Quirk
wordlists are being used in a similar way (i.e. have associated characteristics such as similar relative
frequencies) across all or a portion of the newsgroups in question. For the purposes of this project it
was decided to use the newly released gCLUTO, the more user-friendly package developed by the
inventors of CLUTO21. Whilst gCLUTO is a separate application, it is still based heavily upon the
CLUTO toolkit and as such it proved worthwhile experimenting with the original command-line
CLUTO and referencing the CLUTO manual in interpreting gCLUTOs output.
The input file for CLUTO and gCLUTO is in the form of a matrix (a .mat file). The internal
format of the file itself is straightforward, with the first line containing two numbers, the first being
the number of rows, the second being the number of columns in the matrix. Below this line the
values for the cells of the matrix can be found (for an example see Appendix 2). Two other files are
needed (.clabel and .rlabel files) for labelling the rows and columns respectively, and these
contain the appropriate labels, one per line. In CLUTO, each of these files is specified using the
command described earlier, whereas in gCLUTO the user can select these files using a point-and-
click interface.
The relative frequency data contained in the Excel spreadsheet was converted into the matrix input
format used by gCLUTO (and CLUTO), and fed into the program using the default setup for the
Repeated Bisection clustering method. This method involves the input matrix being split into two
clustered groups, then one of those groups is selected and further bisected this continues until the
specified number of clusters is produced (by default 10). CLUTO uses the Repeated Bisection
clustering method by default (others can be specified please refer to CLUTOs manual for more
8/4/2019 MATLAB Based Visual is at Ion
37/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
36
details21). It was felt that leaving everything at default would give a satisfactory analysis as the purpose
of using CLUTO in this case is only to interpret/ visualise the data rather than in-depth analysis using
different clustering techniques. Of course, due to the versatility and options possible with the
CLUTO toolkit and its derivatives, such in-depth analysis would not be difficult to achieve if so
desired.
This in theory provides a means of determining whether similarities or patterns exist within the
groups and between the groups if so clustering these features should depict this. If similarities or
differences do exist then it should facilitate the derivation of some rules upon which a suitable
automatic text classification system could be built.
gCLUTO in particular produces some interesting visualisations of the data. Alongside the clustering
solution, the user can create a matrix visualisation and/ or a mountain visualisation. The mountainvisualisation represents the relative similarity of clusters as well as their size, internal similarity and
internal deviation. Each cluster is represented by a peak in the terrain, with the shape of the peak
being a Gaussian curve rough estimation of the distribution of data within that cluster. Its height is
proportional to the cluster's internal similarity, and volume is proportional to the number of elements
within the cluster. The colour of the peaks is also important with red indicating low deviation
blending into blue which represents high deviation. Using the Show Features function, the three
most descriptive features of the cluster are displayed above the cluster itself and detailed information
is available about each cluster by clicking on its number (for reasons of visual clarity, this feature was
turned off for the included screenshots). The matrix visualisation uses colours to graphically
represent the data in the matrix with white representing colours near zero, green representing
negative values and red representing positive values (different shades of red & green being
lighter/ darker depending on how close to zero they are). With tree-building enabled, a column tree
and a row tree are generated - These trees effectively depict the relationship between the columns
and rows, showing the relationships between the discovered clusters. Examples of both of these
visualisations will be produced, and explained in the following sections, and an example of both can
be seen in the gCLUTO screenshot in Figure 9. Clustered features are divided by horizontal black
lines across the matrix visualisation area.
The following sub-sections contain details of all of the stages of analysis mentioned above. The
results tables and graphs being discussed are available in Appendix 1 and should be referred to whilst
reading these sections.
8/4/2019 MATLAB Based Visual is at Ion
38/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
37
3.3.2 Cyberspeak: Features for classification?
Cyberspace: a virtual world which relies largely upon written rather than spoken communication. The
receiver (in this case the reader) is unable to see the speaker, hence non-verbal message cues are lost
and the meaning of some messages is distorted or lost altogether. With the reader unable to see a
speaker grin, chuckle, wink or smile, a means of communication unique to cyberspace evolved as a
means of a enhancing meaning. This came to be known as Cyberspeak (sometimes called
Netspeak).
This consists of a number of dynamic elements which
include emotional ASCII icons, otherwise known as
emoticons (see Figure 11) which can be used to
represent facial expressions or actions. These emoticons
are one aspect of cyberspeak, and another is the use of
Cyberspeak acronyms short words made up of the first letters of a lengthy phrase. For example,
by the way is abbreviated to BTW and also known as to AKA in the case of non-verbal
cues, laughing out loud can be abbreviated to LOL. Comprehensive lists of these acronyms exist
in various online dictionaries of Netspeak 16, 17 and there are a number of good sites for the
uninitiated 15, 28 and 46.
Figure 11: Example of "Emoticons"representing smiling, winking andsticking out the tongue respectively
From the initial observations in section 3.2, it became clear that these acronyms could
possibly provide a feature for classifying phenomena such as mood or emotion. For
example excessive use of LOL in a particular group or by a particular person could be
a good indicator of emotion. It is interesting to note the prominence of some of these
features in everyday life for example the LOL acronym can now be found in the
Oxford English Dictionary 2 perhaps a sign of just how much the Internet and
associated technologies have infiltrated modern day society.
For the purposes of this analysis, a selection of acronyms was used as the wordlist for
System Quirk. These are given in Table 1 and full results should be referred to in
Appendix 1b. From the results of the wordlist analysis it is plain to see that the
acronyms specified do exist, although not in large numbers. There appears to be no
clear pattern across all groups, with only LOL, BTW and AKA registering frequencies above ten
(note the smallest group contained more than 22,000 words). In order to get a more accurate view of
akabblbrb
btw
lmao
lol
rotf
rotflmaostfu
ttfn
ttyl
wtfTable 1:Cyberspeak
wordlist forSystem Quirkanalysis
8/4/2019 MATLAB Based Visual is at Ion
39/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
38
the distribution of the words, the relative frequencies for each group were calculated (i.e. frequency
of feature divided by number of words within the group. See Appendix 1b for these results).
It is noticeable that with relative frequencies, one can determine a stark difference in the usage of
LOL across groups. In the alt.politics.democrats newsgroup, there is the highest relative usage of any
group (0.227x10-3), even though this group is almost a third of the size of the largest group in the
corpus (which is alt.politics.usa.republican somewhat contentiously!).
This is perhaps a representation of the characteristics of the newsgroups selected it might be the
case that the participants of the alt.politics.democrats group are more light-hearted and jovial (or
liberal as a cynical republican might suggest), or it might simply be a case of the threads captured at
that time being particularly humorous in nature. LOL can generally be assumed to be a positive
response if we negate the use of sarcasm, however this would defeat the point of this project.
Examining acronyms could aid the visualisation of phenomena such as formality/ informality,
emotion and mood. For example LOL would likely have positive or happy connotations;
similarly the complete lack of cyberspeak phenomena might indicate a more formal message or
subdued group. The difficulty of examining cyberspeak from a language use perspective is that it is a
feature of online communication spaces, and is a relatively new phenomenon which has been
arguably researched less than standard language features such as specific words or phrases. One
could argue that cyberspeak could warrant an entire research project in itself, but the fact that it is
cyber-specific and not used widely means it is difficult to derive generic rules for classification
without knowing more about the context of the rest of the message, and the rest of the entire
communication space.
Similarly, research into emoticons would be interesting, however in this case System Quirk could not
recognise the combination of punctuation marks and symbols which constitute emoticons, therefore
it could not be pursued further. However, it might turn out to be the case that they are not a widely
used feature across USENET There is a chance that emoticons may have been more prominent in
the past when USENET was the domain of the highly technologically-literate, but with the
proliferation of internet access in the early 21st Century there has been a socio-demographic shift in
the user base of USENET specifically, non-technical everyday people have access, and utilise all
aspects of internet communication, including USENET. Further research into this area would help
to support or negate this theory, but is outside the scope of this particular project.
8/4/2019 MATLAB Based Visual is at Ion
40/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
39
The results produced from gCLUTO processing are useful for visualising the data and identifying
clusters that exist. Despite the fact that CLUTO is generally more useful for larger datasets than the
results produced from the cyberspeak analysis, clusters did exist which show interesting relationships
between the newsgroups. The matrix visualisation produced can be seen in Figure 12 and it is clear to
see similarities between groups from the hierarchical column tree created at the top. On the left-hand
branch, the first leaf nodes clustered depicts the relationship between alt.music.lyrics & alt.music, then
between those two groups and uk .media. On the right-hand branch alt.politics.usa &
alt.politics.usa.republican are related, alt.politics & alt.music.smash-pumpkins are related, and sci.environment &
alt.book s are related. These pairings can then be connected as you read up the tree. What this suggests
is that within the clusters discovered, the relative frequencies share similarities which could be an
indicator that the words are being used in a similar way across the related newsgroups. If one glances
across the matrix visualisation it is clear to see that BTW is the most widespread word from the
list, occurring in all but two of the newsgroups. Interestingly, the darkest shades of red (i.e. the
largest values) all occur for keywords within the politics/ debate groups - alt.politics.usa,
alt.politics.usa.republican, alt.politics.democrats and sci.environment alongside this it is worth noting that
most of the cyberspeak words that did exist in the sample occurred in these same groups, perhaps
suggesting that their use within these groups is comparable.
The mountain visualisation depicted in Figure 13 clearly shows the discovered clusters (represented
by the peaks), but also displays an area marked in red which is not at the top of a peak. This indicates
a low deviation across the values in the cluster, but also a low level of internal similarity (i.e. features
within that cluster arent very similar, yet they do not differ greatly in value this is to be expected
with very few values).
8/4/2019 MATLAB Based Visual is at Ion
41/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
40
Figure 12:A gCL U T O M atrix V isualisation of clustered features for "C yberspeak " words within theindicated nine newsgroups. H ere it is clear to see similarities between groups from the hierarchical tree createdat the top. On the left-hand branch, the first leaf nodes clustered depict similarities between alt.music.lyrics &alt.music, then between those two groups and uk .media. O n the right-hand branch alt .politics.usa &alt.politics.usa.republican are related, alt.politics & alt.music.smash-pumpk ins are related, andsci.environment & alt.books are related. T hese pairings can then be connected as you read up the tree.
8/4/2019 MATLAB Based Visual is at Ion
42/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
41
Figure 13:A gCL U T O M ountain V isualisation of clustered features for Cyberspeak . E ach peak in the terrainrepresents a discovered cluster. H eight is proportional to the cluster's internal similarity, and volume isproportional to the number of elements within the cluster, so it is possible to see here that there are several dist inctclusters. Colour is also important with red indicating low deviation blending into blue which represents highdeviation.
8/4/2019 MATLAB Based Visual is at Ion
43/122
JOLYON HUNTER
cs91jh@surrey.ac.uk
42
3.3.3 Personal Pronouns: Features for classification?
personal pronoun (n. )
A pronoun designating the person speak ing (I, me, we, us), the person spok en to (you), or the
person or thing spok en about (he, she, it, they, him, her, them). 6
The nature of personal pronouns means that they are more common in spoken word conversations,
and in this case, the more conversational style associated with internet-based communication. With
the above definition in mind, it is understandable why the usage of these nouns could possibly be an
indicator of any number of phenomena within a conversation tone, mood, emotion, inclusive
behaviour, and exclusivity (perhaps related to a sense of community or belonging which may or may
not exist within a newsgroup). It is these pronouns, constructed into phrase rules (by associating
other words and constructing key phrases), which form the basis of Ellen Spertus Smok ey 38
project. For more details about Smok ey please see section 2.2.1.
Substantial areas for future research exist in the positioning of personal pronouns
within a sentence, i.e. the words surrounding the pronoun and the sentence structure
can provide context to the pronouns usage. However, in this case, the project is
approached from a computer science perspective and as such deals with primarily the
scope for automatically classifying text (and to do this at a computational level, it is
essential to start with the basics and look at individual words). With this in mind,
the frequencies and relative frequencies of the keywords in a personal pronoun
wordlist (see Table 2) were extracted in the same manner as detailed in section 3.4.1.
These results can be found in Appendix 1c.
you
I
he
sheit
they
us
we
themher
him
meTable 2:Pronouns
wordlist forSystem Quirkanalysis
Significant similarities exist in the usage of the pronoun I in six out of the nine
newsgroups with the exceptions being the music-related groups - alt.music, alt.music.lyrics and
alt.music.smash-pumpkins have higher relative frequencies - alt.books which has a lower relative
frequency. This suggests that I is used in a similar way across the political/ debate newsgroupswithin the sample. A similar trend is evident with the use of we, us, them and me in the
political/ debate groups, with all of these keywords having similar relative frequencies for each of the
four groups (namely alt.politics.usa, alt.politics.democrats, alt.politics.usa.
Recommended