21
© 2008 Carnegie Mellon University Detecting Spam and Spam Responses FloCon 2009 Timothy J. Shimeall, Ph.D. CERT/Network Situational Awareness Group

Detecting Spam and Spam Responses

  • Upload
    seth

  • View
    51

  • Download
    3

Embed Size (px)

DESCRIPTION

Detecting Spam and Spam Responses. FloCon 2009 Timothy J. Shimeall, Ph.D. CERT/Network Situational Awareness Group. Overview. Why worry about email Spam Spam Responses. Why worry about E-mail?. Email: at the intersection of ‘business essential’ and ‘favored point of attack’ - PowerPoint PPT Presentation

Citation preview

Page 1: Detecting Spam and Spam Responses

© 2008 Carnegie Mellon University

Detecting Spam and Spam Responses

FloCon 2009Timothy J. Shimeall, Ph.D.

CERT/Network Situational Awareness Group

Page 2: Detecting Spam and Spam Responses

2

Overview

Why worry about email

Spam

Spam Responses

Page 3: Detecting Spam and Spam Responses

3

Why worry about E-mail?

Email: at the intersection of ‘business essential’ and ‘favored point of attack’

Well established usage

Well established weaknesses

Track record of exploitation

Page 4: Detecting Spam and Spam Responses

4

Spam and Spam Sources

Spam is unwanted, fraudulent, or malicious electronic mail

Spam source is a host that sends spam

At the protocol level, spam is correctly formed and sent

• Proper port• Proper messages• Complete TCP session

Page 5: Detecting Spam and Spam Responses

5

Spam Blockers Are Not Enough

Want analytical tool

Blocking message by message is costly and unreliable• Treats symptoms, not problem• Increasing fraction of traffic• Difficult to automatically recognize• Content recognition frequently evaded

Blocking fixed list of addresses is uncertain• Frequently evaded• Uncertain add/drop conditions• Uncertain period of coverage• Arms race

Page 6: Detecting Spam and Spam Responses

6

So How are Spam Sources Different?

Rate of sending email (well known)• Spam is effective only when acted on

(open attachment or link, reply, follow advice, etc.)• Only some users will act on spam• The more spam sent, the more will be effective

Locality of sending email• Diversify the target• Avoid collateral protection at target• Normal email tends to be localized (most of your desired email

comes from people who have sent desired email before)

Ratio of email to non-email flows• Non-email flows are overhead or waste for a spam source• Don’t care about target information – just trying to pump out spam

Page 7: Detecting Spam and Spam Responses

7

Rate of Sending Email

Max SpamHost fan per IP

0

50

100

150

200

250

300

0:0

0

2:0

0

4:0

0

6:0

0

8:0

0

10

:00

12

:00

14

:00

16

:00

18

:00

20

:00

22

:00

Time

Ad

dre

ss

Co

un

t

max(out)

max(in)

max(dark)

Max NonSpam Host Fan per IP

0

2

4

6

8

10

12

14

0:00

2:20

4:40

7:00

9:20

11:4

0

14:0

0

16:2

0

18:4

0

21:0

0

23:2

0

Time

Ad

dre

ss C

ou

nt

max(out)

max(in)

max(dark)

• Spam hosts send at a maximum rate approximately 20 times the maximum rate of non-spam hosts

• Spam hosts send at a mean rate approximately double the maximum rate of non-spam hosts

Page 8: Detecting Spam and Spam Responses

8

Locality of Addresses Sending EmailAddress Arrival in nospam hosts

0

2

4

6

8

10

12

14

16

time

log2

(Add

ress

cou

nt)

log-new-s

log-new-d

Address Arrival in Spam Hosts

0

1

2

3

4

5

6

7

8

time

log2

(Add

ress

Cou

nt)

log-new-s

log-new-d

• Much greater address dynamics for spam hosts than for non-spam hosts

Page 9: Detecting Spam and Spam Responses

9

Ratio of Mail to Non-Mail

Mean Spam Host Flow Category per IP

0

5

10

15

20

25

30

0:00

2:30

5:00

7:30

10:0

0

12:3

0

15:0

0

17:3

0

20:0

0

22:3

0

Time

Flo

w C

ou

nt

mean(smtp)

mean(total)

Mean NonSpam Host Flow Category per IP

0

1

2

3

4

5

6

0:00

2:20

4:40

7:00

9:20

11:4

0

14:0

0

16:2

0

18:4

0

21:0

0

23:2

0

Time

Flo

w C

ou

nt

mean(smtp)

mean(total)

• Much less non-email traffic in spam hosts than in non-spam hosts, particularly during highest peaks

Page 10: Detecting Spam and Spam Responses

10

Outline of Spam Source Detector

Locality exploited to construct white list (Frequent or desired email contacts)

Gather list of sources of completed email contacts in 5 minute period, ignoring white list

• Count number of email contacts per source

• Count number of non-email contacts per source

Drop all sources with less than 15 contacts, and all that had at least 10% non-email contacts

• 15 and 10% derived empirically to maximize true positives and minimize false positives

Page 11: Detecting Spam and Spam Responses

11

Implementation of Spam Detector

1. Use rwfilter to pull one hour of border-crossing flows, excluding white list

2. Use rwfilter to split into 5-minute flows, then pipe to rwfilter to split into email and non-email flows – each going to a bag

3. Use rwbagtool to limit email bag to at least 15 flows

4. Use rwbagcat on each bag and a python script to drop sources with at least 10% non-email traffic

5. Python script produces list of IP addresses

Runs fairly quickly – but not near-real-time speeds

Page 12: Detecting Spam and Spam Responses

12

Validation

Spam sources detected by commercial source

Spam sources detected by email to CERT

Empirical study • At least as good as commonly-applied spam lists

• Much better than random identification

• Varying the constants has some effect on true positive/false positive rate (although white list helps to limit false positives)

Page 13: Detecting Spam and Spam Responses

13

Spam Patterns

Useful for additional confidence

Sources tend to contact in bursts• Send• Wait (hours to days)• Send

Sources tend to be in odd locations• Not mission-relevant• DHCP pools• Multiple sources for similar flow characteristics

Page 14: Detecting Spam and Spam Responses

14

Using the spam detection script

spamdetector.csh start-date end-date [ip.set]

ip.set is for area-of-interest specification

Start-date and end-date are in rwfilter date format

Creates hourly spam source files in current directory• spam-yyyy-mm-dd-h.set• spam-yyyy-mm-dd-h.txt

Page 15: Detecting Spam and Spam Responses

15

Responses to Spam

Most spam doesn’t have valid return addresses• Click in response

• Stock pump-n-dump scams

• Malicious attachments

… But a small fraction does• Advance-fee frauds

• Dating frauds

• Some identity theft schemes

• Marketing email

Flow-based spam detection lets us find responses

Page 16: Detecting Spam and Spam Responses

16

Process of Detecting Spam Responses

Border data (assume spammers outside organization)

Want to avoid programmed responses• Response Flows >1hr after spam

Want to only consider complete email connections with content

• Intial: SYN (not SYN-ACK)• All: SYN-ACK-FIN (not RST)• 4+ Packets• 200+ Bytes/Packet average

Want to consider responses only from addresses close to spam destination

• The Tricky Bit!• A little scripting and PySiLK

S1

S2

Sn

V11

V12

V13

V1k

R1

R2…

Page 17: Detecting Spam and Spam Responses

17

The Matching Scriptimport sysfrom silk import *blocksize=16try: blockfile=file('blocks.csv','r')except: print 'cannot open blocks.csv' sys.exit(1)blockdict=dict()for line in blockfile: fields=line[:-1].strip().split(',') if len(fields)<2: continue try: idx = IPAddr(fields[0].strip()) if idx in blockdict: blockdict[idx].append(IPWildcard(fields[1].strip()+'/'+str(blocksize))) else:

blockdict[idx]=list([IPWildcard(fields[1].strip()+'/'+str(blocksize))]) except: continueblockfile.close()def rwfilter(rec): global blockdict if (rec.dip in blockdict): for pattern in blockdict[rec.dip]: if rec.sip in pattern: return True return False

Load a dictionary of dip & CIDR/16 pairs to match

Filter based on dictionary

Page 18: Detecting Spam and Spam Responses

18

Running the response-detection script

Need to run spam detector first in same directory./findresp.py start-date end-date [ipset] > transcript.txt

Start-date and end-date are silk dates

Ipset is for AOR

Produces listing of commands run to generate results

Result files:

resp-YYYY-M-D.raw

spam-YYYY-M-D.raw

Can then use silk tools to manipulate flows

Page 19: Detecting Spam and Spam Responses

19

Some experience

Large network, 2 days of flows

Spam detected from 643,671 IP addresses

(33M spam flows) requiring approximately 1.5 hours of clock time

Response detected to 1,782 IP addresses (0.28%)

(73,316 response flows) requiring approximately 4 hours of clock time

Page 20: Detecting Spam and Spam Responses

20

Example of Spam Responses

Page 21: Detecting Spam and Spam Responses

21

Summary

Email is a very valuable communication tool• Both for us and for the enemy

Flow-based analysis can give us insight on a variety of email-based behaviors, both benign and malicious