Informal email in the workplace

EMAIL FORMALITY IN THE

WORKPLACE

Kelly Peterson

Outline

Overview of the topic

Paper #1 - "The Dynamics of Electronic Mail as a

Communication Medium“

Paper #2 - "tRuEcasIng“

Paper #3 - "Periods, capitalized words, etc.“

Proposed system

Q & A

2

E-mail Formality in the Workplace

E-mail etiquette is an important issue for communication in an organization

Many etiquette guides warn about developing a negative perception from e-mail formality

“People will notice you but for the wrong reasons”

At the same time, many people use an informal tone to demonstrate a level of trust, understanding or friendship

In addition, some guides claim that an informal tone may encourage a response

3

How do we define Formality?

There are many etiquette guides

Do certain rules have a larger negative “penalty”?

Do certain rules have a greater potential for

gaining\establishing a relationship?

Are rules of formality global across all people?

4

Formality at Enron

Let‟s look at a few examples of formality in the

Enron database

First, we‟ll look at a formal example

5

Formal e-mail example

Hello everyone,

My husband Ryan is going to be participating in the American Heart Walk on November 3. He is walking on behalf of our daughter, Sydney, who was born in May of this year with Congenital Heart Disease.

[...]

Please help this worthwhile cause if you can.

Thanks,

Vicki Versen

6

“Less” formal example

i suck-hope youve made more money in natgas last 3

weeks than i have...

mkt shudbe getting bearish feb forward-cuz we

already have the weather upon us-fuelswitching

and the rest shud invert the whole curve not just dec

cash to jan andfeb forward????

have a good weekend john

7

Ranges of formality in the data

I started becoming obsessed looking at these

So far, I‟ve only scratched the surface

It is clear that there is a wide range of formality in

the organization

It is also clear that there is a wide range of

formality across e-mails from the same sender

There are examples very similar the previous two

which are from the same sender

8

Explanation

How can we account for these differences?

Can we find an explanation for this behavior?

Does this behavior tells us another side of the story

that will enhance or change the findings of other

research?

We‟ll talk about these more after the Papers

9

Paper #1 – Habil & Rafik-Galea

“The Dynamics of Electronic Mail as a Communication

Medium”

An overview of how e-mail is used in the workplace

Discussion of formality and the differences between

written and spoken communication

Not an extremely scientific paper

No CompLing techniques

However, I feel it serves a need to begin the discussion

10

Paper #1 (cont.)

Many of the statements in the paper do not seem to

be well founded, but they help to frame differences

in formality

A common use for e-mail is short notes and

responses

“Senders of e-mail typically behave as if the

medium is like speech”

The above statement likely cannot be globally

applied, but there are certainly examples of this

11

Paper #1 – Features of Formality

Proper capitalization

Proper punctuation

Absence of „…‟, Exclamation marks, etc

Absence of contractions

Absence of 1st and 2nd person pronouns

Absence of slang

Absence of informal tone

“you know”, “so”, “I mean”, “sort of”

More examples on the next Slide…

12

Paper #1 – Features (cont)

Absence of abbreviations

Complete sentences

Standard spellings

As opposed to “thru”, “cuz”, “thanx”, “thx”

13

Paper #1 – Data

The data comes from two organizations in Malaysia

E-mails are intended to show communication which is

horizontal (between employees of equal position)

and vertical (sent to an employee of a higher or

lower position in the company)

14

Paper #1 - Findings

The purpose of the paper as stated is to “identify

and discuss instances of the email messages being

„formal‟ or „conversational‟”

The paper does begin a discussion yet there is little

discovered about “how and why?”

The data shown is merely 3 samples formal and

informal

15

Paper #1 – Summary

Potential explanations are discussed, but no hard

numbers

This paper can help us get started on analysis and

selecting features

16

Preview of Papers 2 and 3

Why were Papers 2 and 3 chosen?

Several other features of formality seemed simple

to extract

Capitalization is an important issue and didn‟t seem

trivial

Since we are dealing with Enron, we want this to be

robust within that domain

Ex : “PCE”, “Prentice”, etc.

17

Paper #2 – Lita et al

“tRuEcasIng”

Process of restoring case information to badly-

cased or non-cased text

Statistical language model

Several other applications as well:

Corpora cleaning

Named Entity Recognition

Machine Translation

Automatic Speech Recognition

18

Paper #2 – Problems to solve

Ambiguity can be a significant problem

Several common words like “pond” and “now” might

actually need to be uppercase

Examples :

“us rep. james pond showed up riding an it and going

to a now meeting”

“US Rep. James Pond showed up riding an IT and going

to a NOW meeting”

19

Paper #2 – Baseline and Approach

The baseline used is a simple unigram model

The approach builds a statistical language model

Probabilities include :

Trigrams, bigrams and unigrams

A trellis is constructed which is very similar to a

Hidden Markov Model

Probabilities are computed at the sentence level

20

Paper #2 – Additional conditions

Unknown words

Mixed casing

21

Paper #2 - Results

Tested against four different test sets

Significant reduction of error compared to the

baseline (unigram model)

On current news stories, the accuracy is ~98%

22

Paper #2 – Future work

Could be applied to:

Accent marks

Punctuation

Additional features could be added or adapted for

improvement

23

Paper #3 - Mikheev

“Periods, Capitalized Words, etc.”

Approach for several aspects of text normalization :

Sentence Boundary Disambiguation (SBD)

Disambiguation of capitalized words

Identification of abbreviations

24

Paper #3 – Preview of Approach

Before going any further, sorry for picking such a

long paper

Coverage will be brief since time is short

Previous work has worked with local contexts

Mikheev proposes a Document-Centered Approach

(DCA) in order to derive information from the entire

document

25

Paper #3 – Building Resources

To use this approach, support resources must be

generated

These resources can be built from raw (unlabeled)

texts

Development resources were created from the New

York Times corpus, but these could also be scraped

from the Internet

26

Paper #3 – Building resources (cont)

List 1 - Common word list

All lowercase words

Threshold was used to prevent source errors in spelling or capitalization

List 2 - Frequent sentence starter list of common words

For all words starting sentences, they are added to the list if they also belong to the common word list

Not perfect, but it provides the 200 most frequent common words that start sentences

27

Paper #3 – Building resources (cont)

List 3 - Frequent proper names list

Single word proper names that also coincide with the

list of common words

Captures words like „China‟ which are also present as

common words like „china‟

Again, the 200 most frequent instances are on the list

4 - Abbreviations list

Collected by applying abbreviation guessing heuristics

28

Paper #3 - Strategies

A cascade of strategies are applied in a specific

order

These strategies use the list resources

Each of these strategies provides different

coverage :

Sequence strategy

Frequent-list lookup strategy

Single-Word Assignment

Quotes, Brackets and “After Abbr.” Heuristic

29

Paper #3 - Results

Results are competitive with other machine learning

and rule-based systems when comparing SBD,

Capitalized words and Abbreviations

Incorporating the DCA method into a POS tagger

significantly reduced error rate

Robust with respect to domain shift and new lexica

30

Paper #3 - Limitations

Processing relies on “well behaved” (non-noisy) text

Not expected to perform well for single cased texts

Short documents -> Not enough clues

Long documents -> Too many clues

Potential solution for short documents is to make use

of a “caching module” to propagate features from

one document to the next

31

Paper #3 – Other testing

Tested on a corpus of Russian news

Different language

Short documents (1-2 paragraphs)

32

Paper #3 – Interesting quote

“We deliberately shaped our approach so that it

largely does not rely on precompiled statistics,

because the most interesting events are inherently

infrequent and hence are difficult to collect reliable

statistics for”

33

Proposed System

34

Questions

Using the dataset from Jabbari et al, are senders

more likely to be formal in business emails (as

opposed to personal)?

Are certain positions in the company more likely to

be formal?

Are senders more likely to be formal when sending

to a person of higher position?

Are senders more likely to be formal with more

people on a thread (“Broadcast”)

35

Questions (cont)

How likely are senders to use informal

communication on first email contact?

What is the average number of emails before

communication switches from formal to informal?

How often does communication between “switch”

from informal to formal?

Does formal communication become more or less

prominent during the media coverage of the

scandal?

36

Questions (cont)

Are senders likely to echo the style of the person

they respond to?

Is there much shift over time in an individual‟s

formality?

Do “informal connections” support the findings of

Social Network Analysis and other research?

37

Research Issues

It seems that the range of what is considered

“maximum formality” or “minimum formality” is

different across the company

Each sender has their own range of formality

38

Research Issues – Gold Standard?39

Since each sender has a range, annotator

agreement on “overall formality” seems impossible

Annotator cannot classify as “Formal\Informal”

Even a 5 point scale is not reasonable

Best annotation is likely a count of each “informal

speech act”

Analysis of Formality

Content features to be used :

Capitalization issues

Punctuation issues („…‟, Exclamation points)

Contractions

Complete sentences

Q : Since email length differs, how can I normalize these features?

Q : Should each of these dimensions be tracked discretely or calculated into a full score?

40

Analysis of formality (cont)

How to capture capitalization issues?

Possibly create a hybrid solution of both papers (Lita et

al, Mikheev)

Might be able to create DCA resources both from a

“clean” corpus and also the “cleanest” data in Enron

Domain-specific capitalization will likely be critical

Ex : „Lay‟ (Kenneth) vs. „lay‟

41

Analysis of Formality (cont)

Some features seem more difficult to normalize

since they can occur at most once :

Greeting

Sign-off

Q : Should these be used to determine formality or

used for comparison after analysis has been

completed?

42

Comparison metrics

Once I can quantify formality, two important metrics

will be needed:

Average formality across the organization

Average formality across each sender

Comparing each e-mail against these averages with

respect to standard deviation will help us determine

messages which are more or less formal

Significant differences across sender formality will

used for most questions

43

Data capture and Results

Metrics of formality will be stored in a new

database table so that relationships can easily be

analyzed against other data (times, recipients,

business vs. personal, etc)

The data members of this table will capture each

selected dimension of formality and possibly a total

score

Should be simple to start generating initial reports

and answering research questions

44

Questions? ANSWERS???

Questions?

Ideas on how to quantify \ normalize this notion of

“formality?

45

Documents

Informal email in the workplace