39
How Many Folders Do You Really Need? Classifying Email into a Handful of Categories Date:2015/07/08 Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek Source:CIKM '14 Advisor:Jia-ling Koh Spearker:LIN,CI-JIE 1

How Many Folders Do You Really Need?Classifying Email into a Handful of Categories

  • Upload
    -

  • View
    57

  • Download
    2

Embed Size (px)

Citation preview

1

How Many Folders Do You Really Need?

Classifying Email into a Handful of CategoriesDate:2015/07/08

Author:Mihajlo Grbovic, Guy Halawi, Zohar Karnin, Yoelle Maarek

Source:CIKM '14

Advisor:Jia-ling Koh

Spearker:LIN,CI-JIE

2

OutlineIntroductionMethodExperimentConclusion

3

OutlineIntroductionMethodExperimentConclusion

4

Introduction Email classification is still a mostly manual task

5

Introduction Recently automatic classification offering the same categories

to all users has started to appear in some Web mail clients

6

Introduction Today's commercial Web mail traffic is dominated by

machine-generated messages social networks,e-commerce sites,etc

7

Introduction Goal

Automatically distinguishing between personal and machine-generated email

Classifying messages into latent categories,without requiring users to have defined any folder

8

OutlineIntroductionMethodExperimentConclusion

9

Overflow

Email raw data

LDA cluster Latent categories

Feature extraction Aggregation Training data

generation

Test data

10

Overflow

Email raw data

LDA cluster Latent categories

Feature extraction Aggregation Training data

generation

Test data

11

DISCOVERING LATENT CATEGORIES

Retrieving the most “popular” folders created by users ignored system folders (e.g., “trash”, “spam”)

Applied LDA to these document folders in order to discover a set of latent topics latent topics would map into “latent categories”

LDALatent

categories

12

DISCOVERING LATENT CATEGORIES

The topics obtained for K = 6, as this value exposed a good balance between total and individual coverage The email traffic coverage at K = 6 was 70%

machine generatedhuman generated

Overflow

13

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

14

Extracting Features Content features

extract words from the subject line and message body the subject character length, body character length the number of urls occurring in the body

Address features features extracted from the sender email address the subdomains (e.g. .edu,.gov, etc.) and subnames(e.g. billing, noreply)

15

Extracting Features Behavioral features

weekly and monthly volumes of sent messages volumes of messages sent as a reply volumes of messages sent as forward (with FW: in the subject line) volume of the messages received by the sender volume of the messages received as a reply volume of the messages received as a forward

16

Extracting Features Temporal behavior features

Record whether a sender sends more than X messages in an hour X takes as values: 10, 60, 80, 100, 120

Overflow

17

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

Aggregation

18

Financial

Overflow

19

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

20

TRAINING DATA consider 3 types of labeling techniques

manual heuristic-based automatic

6 latent categories human career shopping travel finance social

21

Manual labeling Human editors assign labels to specific examples

22

Heuristic labeling Used this type of labeling mostly for differentiating between

human and machine senders Identify corporate machine senders

such as “mailer-daemon” or “no-reply” repeating occurrences of words such as “unsubscribe” in message

headers SMTP domain information

Identify human senders <first name>.<lastname>@

Automatic labeling Folder-based majority voting

23

[email protected]

purchase:55 ebay:4

credit cards:1 hotel:6

Shopping finance travel

purchase Credit cards

Hotel

ebay

55+4 1 6

Category:Shopping59>50(threshold),num of folders >1(threshold)

Automatic labeling Folder-based LDA voting

24

[email protected]

purchase:Shopping:70%Finance:20%

ebay:Shopping:60%Social:10%

credit cardsFinance:90%Shopping:10%

hotel:Travel:74%Finance:15%

Category:ShoppingShopping:0.7+0.6+0.1 Travel:0.74Finance:0.2+0.9+0.15Social:0.1

Overflow

25

Email raw data

LDA clusterLatent

categories

Feature extraction

AggregationTraining data

generation

Test data

26

CLASSIFICATION MECHANISM

27

CLASSIFICATION MECHANISM Online lightweight classification

consisting of hard-coded rules designed to quickly classify finding the top 100 senders that cover a significant percentage of the

total traffic and are category consistent categorizing all reply/forward messages as human

CLASSIFICATION MECHANISM Online sender-based classification

looking for the sender in a lookup table containing senders with known categories

28

[email protected]

sender category

[email protected]

shopping

[email protected]

travel

[email protected] shopping

[email protected] finance

lookupshopping

CLASSIFICATION MECHANISM Offline creation of classified senders table

use the training set to train a logistic regression model train a separate model in a one-vs-all manner the classification process is run performed periodically to account for

new senders

29

new email

human

shopping

finance

travel

social

career

logistic regression

sender category

new email finance

.

.

.

.

.

.

30

CLASSIFICATION MECHANISM Online Heavy-weight classification

email messages whose sender did not appear in the classified sender table are sent to a heavy-weight message based classifier

use all relevant feature, pertaining to the message body, subject line and sender name

employed a logistic regression classifier

31

CLASSIFICATION MECHANISM Offline training the message-level classifier

a logistic regression model is trained for each category in a one-vs-all model

the training process is quite similar to the sender classification which is of course different as it contains messages rather than senders

32

OutlineIntroductionMethodExperimentConclusion

33

Experiment Experimental evaluation was performed on more than 500

billion messages received during a period of six months by users of Yahoo mail service

34

Experiment

35

Experiment

AUC (one vs rest classification) Performance on different feature subsets

content features (email body, subject, etc.)

36

Experiment

37

OutlineIntroductionMethodExperimentConclusion

38

CONCLUSION Presented here a Web-scale categorization approach

offline learning online classification

Discovered latent categories Categories cover more than 70% of both email traffic and

email search queries

39

Thanks for listening