Spam Sinkholing Nick Feamster. Introduction Goal: Identify bots (and botnets) by observing second-order effects –Observe application behavior thats likely

Spam Sinkholing

Nick Feamster

Introduction

• Goal: Identify bots (and botnets) by observing second-order effects– Observe “application” behavior that’s likely to contain

bot activity (spam is a good candidate: > 85% of spam coming from bots as of 4Q 2005)

• Advantages: – Direct observation of behavior– Potentially very wide lens– Passive

• Disadvantage: No ground truth

Spam Collection Overview

• Trap mail sent to “dead” domains

• Log IPs

• Perform active and passive measurements– Traceroute– Passive SYN fingerprints– DNSBL lookups, etc.

Data Collection Overview

Mail Avenger

sendmail

Spammer

Spammer

Spammer

DNS

MX lookupsResolve to sinkhole Blowtorch (GTISC)

dynamorsync

(schema on wiki)

O(100k) pieces of spam per week

Hundreds of domains

Sample Mail Avenger Header

Highly configurable SMTP server that collects many useful statistics

Database Schema Sample

CREATE TABLE spamtrap_email ( entrytime timestamp with timezone default NULL, trap_domain text default NULL, client_ip ip4 default NULL, client_port smallint default NULL, traceroute_time timestamp with timezone default NULL, to_ text default NULL, delivered_to text default NULL, subject text default NULL, xmailer text default NULL, from_ text default NULL, emailid serial default NULL, FOREIGN KEY(dnsbl_id) on spamtrap_dnsbl(dnsbl_id),

) tablespace dataspace;

Uses for Data

• Identification: Low-confidence list of likely bot IPs

• Bootstrapping: Use as a starter set for some “intractable” analysis problems– Use this low-confidence list to prune DNSBL graph mining– Feed this information back to ISPs to focus mining

• Second-order effects– Analysis of hosting sites for URLs– Clustering

Analysis Within Spam Dataset

• Clustering to identify groups (coordination suggests likely bot)– Temporal-based correlation– Content-based correlation

• Based on URLs

• Analysis of hosting URLs: Perhaps useful for identifying phishing sites– Where hosted?– Transience?

Correlation: Across Datasets

• DNSBL datasets require bootstrapping– As per SRUTI paper– Use spam dataset as a graph pruning mechanism

• Possibility: Use spam sinkhole as a source for malware. Strip attachments.– Likely already being done by lots of others

• Get information about exfiltration email addresses and domains from binary analysis– Look for those appearing in sinkhole to build confidence and

monitor ongoing activity

Documents

Spam Sinkholing Nick Feamster. Introduction Goal: Identify bots (and botnets) by observing second-order effects –Observe application behavior thats likely