Understanding Email Traffic (talk @ E-Discovery NL Symposium)

  • View
    725

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Text of Understanding Email Traffic (talk @ E-Discovery NL Symposium)

  • Understanding email traffic David Graus, University of Amsterdam d.p.graus@uva.nl @dvdgrs
  • 2
  • 3 Recipient recommendation Given a sender, an email, all possible recipients (in an enterprise); Predict which recipient(s) are most likely to receive the email
  • 4 Why? Understanding communication in/structure of an enterprise Applications in: enterprise search expert finding community detection spam classification anomaly detection
  • 5 How? Gmail Who do you frequently co-address egonetwork Related work Social Network Analysis (SNA) Email content Us SNA + Email content
  • 6 Part 1: Social Network Analysis? d.p.graus@uva.nl z.ren@uva.nl derijke@uva.nl
  • 7 image by Calvinius - Creative Commons Attribution-Share Alike 3.0
  • 8 SNA for predicting recipients? 1. Importance of a node in the network More important people are more likely to be the recipient of an email 2. Strength of connection between two nodes Given sender of the email, the recipients who are frequently addressed are more likely to be the recipient
  • 9 SNA for predicting recipients? 1. Importance of a node in the network 1. Number of received emails 2. PageRank score of node 2. Strength of connection between two nodes 1. Number of emails sent between nodes 2. Number of times two nodes are adressed together
  • 10 Part 2: Email content Statistical Language Models (LMs) ! Assign a probability to a sequence of words; Compute models for different corpora; ! Used in lots of places; Information Retrieval Machine Translation Speech Recognition
  • 11 Language Models Language models as communication profiles
  • 12 Language Models Language models as communication profiles 1. Incoming LM (how people talk to user)
  • 13 Language Models Language models as communication profiles 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people)
  • 14 Language Models Language models as communication profiles 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  • 15 Language Models Language models as communication profiles 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2)
  • 16 Language Models Language models as communication profiles 1. Incoming LM (how people talk to user) 2. Outgoing LM (how user talks to people) 3. Interpersonal LM (how node1 talks with node2) 4. Corpus LM (how everyone talks)
  • 17 Why language models? Comparisons between communication profiles: Find nodes with most similar communication
  • 18 SNA ! ! 1. Importance of a node in the network ! 3. Strength of connection between nodes ! ! ! Email Content ! ! 1. Incoming LM 2. Outgoing LM 3. Interpersonal LM 4. Corpus-based LM
  • 19 Approach: time-based t=0 1 email, 2 addresses t=1 2 emails, 2 addresses t=2 3 emails, 4 addresses t=3 4 emails, 5 addresses ! etc ! t=n 607.011 emails, 2.068 addresses
  • 20 At some time interval t Given the email, sender, and network Remove recipients from email Rank all nodes in the network By computing for each candidate (recipient) node: 1. Importance of candidate 2. Strength of connection between sender and candidate 3. Similarity between sender and candidate LMs
  • 21
  • 22 Findings: what works for predicting recipients? Importance of node: Number of received emails of node ! Strength of connection: Number of emails between nodes ! LM Similarity: Interpersonal LM is most important
  • 23 Findings: SNA vs email content SNA: SNA signals deteriorate over time SNA signals are most informative on highly active users ! Email content: LM signal improves over time LM signal does worse with highly active users
  • 24 Finally Combining Social Network Analysis with Language Modeling is better than doing either.
  • 25 Why for E-Discovery Anomaly detection Given a working prediction model; identify unexpected communication Language models for communication For a node, find the most different interpersonal communication Friends/family vs colleagues? Find communication that differs from the corpus-based communication