48
Network-Level Spam and Scam Defenses Nick Feamster Georgia Tech with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Jaeyeon Jung, Santosh Vempala

Network-Level Spam and Scam Defenses

  • Upload
    everly

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Network-Level Spam and Scam Defenses. Nick Feamster Georgia Tech. with Anirudh Ramachandran, Shuang Hao, Maria Konte Alex Gray, Jaeyeon Jung, Santosh Vempala. Spam: More than Just a Nuisance. 95% of all email traffic Image and PDF Spam (PDF spam ~12%) - PowerPoint PPT Presentation

Citation preview

Page 1: Network-Level Spam and Scam Defenses

Network-Level Spam and Scam Defenses

Nick FeamsterGeorgia Tech

with Anirudh Ramachandran, Shuang Hao, Maria KonteAlex Gray, Jaeyeon Jung, Santosh Vempala

Page 2: Network-Level Spam and Scam Defenses

2

Spam: More than Just a Nuisance

• 95% of all email traffic– Image and PDF Spam

(PDF spam ~12%)

• As of August 2007, one in every 87 emails constituted a phishing attack

• Targeted attacks on the rise– 20k-30k unique phishing attacks per month

Source: CNET (January 2008), APWG

Page 3: Network-Level Spam and Scam Defenses

3

Approach: Filter

• Prevent unwanted traffic from reaching a user’s inbox by distinguishing spam from ham

• Question: What features best differentiate spam from legitimate mail?– Content-based filtering: What is in the mail?– IP address of sender: Who is the sender?– Behavioral features: How the mail is sent?

Page 4: Network-Level Spam and Scam Defenses

Content Filters: Chasing a Moving Target

...and even mp3s!

PDFs Excel sheets Images

Page 5: Network-Level Spam and Scam Defenses

5

Problems with Content Filtering

• Customized emails are easy to generate: Content-based filters need fuzzy hashes over content, etc.

• Low cost to evasion: Spammers can easily alter features of an email’s content can be easily adjusted and changed

• High cost to filter maintainers: Filters must be continually updated as content-changing techniques become more sophisticated

Page 6: Network-Level Spam and Scam Defenses

6

Another Approach: IP Addresses

• Problem: IP addresses are ephemeral

• Every day, 10% of senders are from previously unseen IP addresses

• Possible causes– Dynamic addressing– New infections

Page 7: Network-Level Spam and Scam Defenses

7

Our Idea: Network-Based Filtering

• Filter email based on how it is sent, in addition to simply what is sent.

• Network-level properties are less malleable– Network/geographic location of sender and receiver– Set of target recipients– Hosting or upstream ISP (AS number)– Membership in a botnet (spammer, hosting

infrastructure)

Page 8: Network-Level Spam and Scam Defenses

8

Why Network-Level Features?

• Lightweight: Don’t require inspecting details of packet streams– Can be done at high speeds– Can be done in the middle of the network

• Robust: Perhaps more difficult to change some network-level features than message contents

Page 9: Network-Level Spam and Scam Defenses

9

Challenges• Understanding network-level behavior

– What network-level behaviors do spammers have?– How well do existing techniques work?

• Building classifiers using network-level features– Key challenge: Which features to use?– Two Algorithms: SNARE and SpamTracker

• Building the system – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there

• Applications to phishing and scams

Page 10: Network-Level Spam and Scam Defenses

10

Data: Spam and BGP• Spam Traps: Domains that receive only spam• BGP Monitors: Watch network-level reachability

Domain 1

Domain 2

17-Month Study: August 2004 to December 2005

Page 11: Network-Level Spam and Scam Defenses

11

Finding: BGP “Spectrum Agility”• Hijack IP address space using BGP• Send spam• Withdraw IP address

A small club of persistent players appears to be using

this technique.

Common short-lived prefixes and ASes

61.0.0.0/8 4678 66.0.0.0/8 2156282.0.0.0/8 8717

~ 10 minutes

Somewhere between 1-10% of all spam (some clearly intentional,

others might be flapping)

Page 12: Network-Level Spam and Scam Defenses

12

Spectrum Agility: Big Prefixes?

• Flexibility: Client IPs can be scattered throughout dark space within a large /8– Same sender usually returns with different IP

addresses

• Visibility: Route typically won’t be filtered (nice and short)

Page 13: Network-Level Spam and Scam Defenses

13

How Well do IP Blacklists Work?

• Completeness: The fraction of spamming IP addresses that are listed in the blacklist

• Responsiveness: The time for the blacklist to list the IP address after the first occurrence of spam

Page 14: Network-Level Spam and Scam Defenses

14

Completeness and Responsiveness

• 10-35% of spam is unlisted at the time of receipt• 8.5-20% of these IP addresses remain unlisted

even after one month

Data: Trap data from March 2007, Spamhaus from March and April 2007

Page 15: Network-Level Spam and Scam Defenses

15

Problems with IP Blacklists

• IP addresses of senders have considerable churn

• Based on ephemeral identifier (IP address)– More than 10% of all spam comes from IP addresses not seen

within the past two months• Dynamic renumbering of IP addresses• Stealing of IP addresses and IP address space• Compromised machines

• Often require a human to notice/validate the behavior– Spamming is compartmentalized by domain and not analyzed

across domains

Page 16: Network-Level Spam and Scam Defenses

16

Outline

• Understanding the network-level behavior– What behaviors do spammers have?– How well do existing techniques work?

• Classifiers using network-level features– Key challenge: Which features to use?– Two algorithms: SNARE and SpamTracker

• System: SpamSpotter – Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there

• Application to phishing and scams

Page 17: Network-Level Spam and Scam Defenses

17

Finding the Right Features

• Goal: Sender reputation from a single packet?– Low overhead– Fast classification– In-network– Perhaps more evasion resistant

• Key challenge– What features satisfy these properties and can

distinguish spammers from legitimate senders?

Page 18: Network-Level Spam and Scam Defenses

18

Set of Network-Level Features

• Single-Packet– Geodesic distance– Distance to k nearest senders– Time of day– AS of sender’s IP– Status of email service ports

• Single-Message– Number of recipients– Length of message

• Aggregate (Multiple Message/Recipient)

Page 19: Network-Level Spam and Scam Defenses

19

Sender-Receiver Geodesic Distance

90% of legitimate messages travel 2,200 miles or less

Page 20: Network-Level Spam and Scam Defenses

20

Density of Senders in IP Space

For spammers, k nearest senders are much closer in IP space

Page 21: Network-Level Spam and Scam Defenses

21

Local Time of Day at Sender

Spammers “peak” at different local times of day

Page 22: Network-Level Spam and Scam Defenses

22

Combining Features: RuleFit• Put features into the RuleFit classifier• 10-fold cross validation on one day of query logs

from a large spam filtering appliance provider

• Comparable performance to SpamHaus– Incorporating into the system can further reduce FPs

• Using only network-level features• Completely automated

Page 23: Network-Level Spam and Scam Defenses

23

Benefits of Whitelisting

Whitelisting top 50 ASes:False positives reduced to 0.14%

Page 24: Network-Level Spam and Scam Defenses

24

Outline

• Understanding the network-level behavior– What behaviors do spammers have?– How well do existing techniques work?

• Building classifiers using network-level features– Key challenge: Which features to use?– Algorithms: SpamTracker and SNARE

• System (SpamSpotter)– Dynamism: Behavior itself can change– Scale: Lots of email messages (and spam!) out there

Page 25: Network-Level Spam and Scam Defenses

25

Deployment: Real-Time Blacklist

• As mail arrives, lookups received at BL

• Queries provide proxy for sending behavior

• Train based on received data

• Return score

Approach

Page 26: Network-Level Spam and Scam Defenses

26

Design Choice: Augment DNSBL• Expressive queries

– SpamHaus: $ dig 55.102.90.62.zen.spamhaus.org

• Ans: 127.0.0.3 (=> listed in exploits block list)– SpamSpotter: $ dig \

receiver_ip.receiver_domain.sender_ip.rbl.gtnoise.net

• e.g., dig 120.1.2.3.gmail.com.-.1.1.207.130.rbl.gtnoise.net

• Ans: 127.1.3.97 (SpamSpotter score = -3.97)

• Also a source of data– Unsupervised algorithms work with unlabeled

data

Page 27: Network-Level Spam and Scam Defenses

27

Challenges

• Scalability: How to collect and aggregate data, and form the signatures without imposing too much overhead?

• Dynamism: When to retrain the classifier, given that sender behavior changes?

• Reliability: How should the system be replicated to better defend against attack or failure?

• Evasion resistance: Can the system still detect spammers when they are actively trying to evade?

Page 28: Network-Level Spam and Scam Defenses

28

Latency

Performance overhead is negligible.

Page 29: Network-Level Spam and Scam Defenses

29

Sampling

Relatively small samples can achieve low false positive rates

Page 30: Network-Level Spam and Scam Defenses

30

Possible Improvements

• Accuracy– Synthesizing multiple classifiers– Incorporating user feedback– Learning algorithms with bounded false positives

• Performance– Caching/Sharing– Streaming

• Security– Learning in adversarial environments

Page 31: Network-Level Spam and Scam Defenses

31

Spam Filtering: Summary

• Spam increasing, spammers becoming agile– Content filters are falling behind– IP-Based blacklists are evadable

• Up to 30% of spam not listed in common blacklists at receipt. ~20% remains unlisted after a month

• Complementary approach: behavioral blacklisting based on network-level features– Key idea: Blacklist based on how messages are sent– SNARE: Automated sender reputation

• ~90% accuracy of existing with lightweight features– SpamSpotter: Putting it together in an RBL system– SpamTracker: Spectral clustering

• catches significant amounts faster than existing blacklists

Page 32: Network-Level Spam and Scam Defenses

32

Phishing and Scams

• Scammers host Web sites on dynamic scam hosting infrastructure– Use DNS to redirect users to different sites

when the location of the sites move

• State of the art: Blacklist URL

• Our approach: Blacklist based on network-level fingerprints

Konte et al., “Dynamics of Online Scam Hosting Infrastructure”, PAM 2009

Page 33: Network-Level Spam and Scam Defenses

33

Online Scams

• Often advertised in spam messages• URLs point to various point-of-sale sites• These scams continue to be a menace

– As of August 2007, one in every 87 emails constituted a phishing attack

• Scams often hosted on bullet-proof domains

• Problem: Study the dynamics of online scams, as seen at a large spam sinkhole

Page 34: Network-Level Spam and Scam Defenses

34

Online Scam Hosting is Dynamic

• The sites pointed to by a URL that is received in an email message may point to different sites

• Maintains agility as sites are shut down, blacklisted, etc.

• One mechanism for hosting sites: fast flux

Page 35: Network-Level Spam and Scam Defenses

35

Mechanism for Dynamics: “Fast Flux”

Source: HoneyNet Project

Page 36: Network-Level Spam and Scam Defenses

36

Summary of Findings

• What are the rates and extents of change?– Different from legitimate load balance– Different cross different scam campaigns

• How are dynamics implemented?– Many scam campaigns change DNS mappings at all

three locations in the DNS hierarchy• A, NS, IP address of NS record

• Conclusion: Might be able to detect based on monitoring the dynamic behavior of URLs

Page 37: Network-Level Spam and Scam Defenses

37

Data Collection Method

• Three months of spamtrap data– 384 scam hosting domains– 21 unique scam campaigns

• Baseline comparison: Alexa “top 500” Web sites

Page 38: Network-Level Spam and Scam Defenses

38

Top 3 Spam Campaigns

• Some campaigns hosted by thousands of IPs• Most scam domains exhibit some type of flux• Sharing of IP addresses across different roles

(authoritative NS and scam hosting)

Page 39: Network-Level Spam and Scam Defenses

39

Rates of Change

• How (and how quickly) do DNS-record mappings change?

• Rates of change are much faster than for legitimate load-balanced sites.– Scam domains change on shorter intervals than their

TTL values.

• Domains for different scam campaigns exhibit different rates of change.

Page 40: Network-Level Spam and Scam Defenses

40

Rates of Change

• Domains that exhibit fast flux change more rapidly than legitimate domains

• Rates of change are inconsistent with actual TTL values

Rates of change are much faster than for legitimate load-balanced sites.

Page 41: Network-Level Spam and Scam Defenses

41

Time Between Record ChangesFast-flux Domains tend to change much more frequently than legitimately hosted sites

Page 42: Network-Level Spam and Scam Defenses

42

Rates of Change by CampaignDomains for different scam campaigns exhibit different

rates of change.

Page 43: Network-Level Spam and Scam Defenses

43

Rates of Accumulation

• How quickly do scams accumulate new IP addresses?

• Rates of accumulation differ across campaigns• Some scams only begin accumulating IP

addresses after some time

Page 44: Network-Level Spam and Scam Defenses

44

Rates of Accumulation

Page 45: Network-Level Spam and Scam Defenses

45

Location

• Where in IP address space do hosts for scam sites operate?

• Scam networks use a different portion of the IP address space than legitimate sites– 30/8 – 60/8 --- lots of legitimate sites, no scam sites

• Sites that host scam domains (both sites and authoritative DNS) are more widely distributed than those for legitimate sites

Page 46: Network-Level Spam and Scam Defenses

46

Location: Many Distinct SubnetsScam sites appear in many more distinct networks

than legitimate load-balanced sites.

Page 47: Network-Level Spam and Scam Defenses

47

Conclusion• Scam campaigns rely on a dynamic hosting

infrastructure• Studying the dynamics of that infrastructure may

help us develop better detection methods

• Dynamics– Rates of change differ from legitimate sites, and differ

across campaigns– Dynamics implemented at all levels of DNS hierarchy

• Location– Scam sites distributed across distinct subnets

Data: http://www.gtnoise.net/scam/fast-flux.html TR: http://www.cc.gatech.edu/research/reports/GT-CS-08-07.pdf

Page 48: Network-Level Spam and Scam Defenses

48

References• Anirudh Ramachandran and Nick Feamster, “Understanding

the Network-Level Behavior of Spammers”, ACM SIGCOMM, August 2006

• Anirudh Ramachandran, Nick Feamster, and Santosh Vempala, “Filtering Spam with Behavioral Blacklisting”, ACM CCS, November 2007

• Shuang Hao, Nick Feamster, Alex Gray and Sven Krasser, “SNARE: Spatio-temporal Network-level Automatic Reputation Engine”, USENIX Security, August 2009

• Maria Konte, Nick Feamster, Jaeyeon Jung, “Dynamics of Online Scam Hosting Infrastructure”, Passive and Active Measurement, April 2009