21
Measurement of paedophile activity in eDonkey Guillaume Valadon - http://valadon.complexnetworks.fr http://antipaedo.lip6.fr LIP6 (CNRS - UPMC) Complex Networks team http://complexnetworks.fr The team ! http://complexnetworks.fr : plots & videos 4 permanent members : Jean-Loup Guillaume, Matthieu Latapy, Bénédicte Le Grand, Clémence Magnien 2 postdocs, 9 Ph.D. students ! Focus & interests: Internet topology, P2P networks, social networks measurements analysis 2

Measurement of paedophile activity in eDonkey

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

!"#$%&$"'"#() *$"#+,"-()./')$-0

10(01

2+%3,* ) 4""$56,)7$+/ 8%$*+))$#

9(! 1,:%;:,+('&,+%($"/'/. :"(3<'"#+$%(+=$"#+#'+%$(>

!"##$%&'

?(! @:$))$('AA%,3<$(*.#<,=,),/+;:$(>

B(! 1%.-$"#'#+,"(=:(",:&$':(C),3(*'%;:$

D(! 4"","3$($#(=.A),+$*$"#(=:(",:&$':(C),3(*'%;:$

Measurement of paedophile activityin eDonkey

Guillaume Valadon - http://valadon.complexnetworks.fr

http://antipaedo.lip6.fr

LIP6 (CNRS - UPMC)

Complex Networks teamhttp://complexnetworks.fr

The team!http://complexnetworks.fr : plots & videos

– 4 permanent members : Jean-Loup Guillaume, Matthieu Latapy, Bénédicte Le Grand, Clémence Magnien

– 2 postdocs, 9 Ph.D. students

! Focus & interests:

– Internet topology, P2P networks, social networks

– measurements

– analysis2

What is peer-to-peer ?!exchanges do not rely on a server

– content is inside peers or created by peers

!peers are equal

– clients and servers at the same time

!peer removal not problematic

!Usages:

– file sharing: eDonkey, bittorrent

– telephony: skype

– video streaming: joost, PPlive

3

!servers are catalogs of files

!peers:

! inform the servers

!search files on servers

!download files from peers

What is eDonkey ?

4

eDonkeyServer

eDonkeyServer

eDonkeyServer

Peer 1

Peer 2

“I have file A”

“I am looking for file A”

“I want to download file A”

Context

!study exchanges in eDonkey

– files diffusion

– communities of interests

– popularity

!some motivations

– understand users’ behaviour

– blind content detection

– detect paedophile activities

5

Outline

6

1. eDonkey measurements

1.server side

2.honeypot

3.client side

2. results & statistics

3. community detection

eDonkey exchanges:what can be observed ?

1. inter-server communications

statistical data about servers usages & peers

2. peer-server communications

index of files, file search, source search

3. inter-peer communications

file downloads, retrieve lists of files

7

Looking for a file

8

Clie

nt

Serv

eur

Sources search

Keywords search

Sources list

Files list

Pe

er

Se

rve

r

1. “music mp3”

2. name: ACDC.mp3

ID:101050

name:rock.mp3

ID:B16V100

3. ID:B16V100

4. 192.168.0.2

1.2.3.4

1

2

3

4

Capturing traffic on a real server

9

Capture clienteDonkey server

PCAP dump

PCAP flow

PCAP decoding

eDonkey traffic

UDP traffic

inspection

Anonymisation andformatting

XML encoding

eDonkey exchanges

<opcode dir="received" TS="2786402.373146" IP="0045125351"

type="high" port="02029"><OP_GLOBSEARCHREQ>

<tags count="1"><anon-string>3108886</anon-string></tags>

</OP_GLOBSEARCHREQ></opcode>

words that appearless than 100 times

Resulting data set in numbers [HotP2P’09]

!10 weeks measurements

!~500 GB of compressed XML

!~ 10 billions messages

!~ 90 millions peers

!~ 280 millions of distinct files

! anonymized data available online at http://antipaedo.lip6.fr

10

Honeypot based measurements[HotP2P’09]

11

!eDonkey honeypot:

– customized eDonkey client

– announce files to a server (filename, ID, size)

– log queries made by regular peers

!Manager:

– control distributed honeypots

– send commands to honeypots: server to connect to, files to exchange, ...

Downloading a file

12

Send nothing

Send random content

Retrieve list of files

owned by the peer

Methodology

13

!24 PlanetLab nodes, running distributed honeypots:

– 12 sending no content

– 12 sending random content

! 1 greedy honeypot:

– learn files during the first day

– afterwards, announce these files

distributed greedy

Honeypots 24 1

Duration in days 32 15

Shared files 4 3 175

Distinct peers 110 049 871 445

Distinct files 28 007 267 047

Parameters : distributed or greedy

14

0

20000

40000

60000

80000

100000

120000

0 5 10 15 20 25 30 35 0

1000

2000

3000

4000

5000

6000

Tota

l num

ber

of peers

Num

ber

of new

peers

Days

total peernew peers

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

0 2 4 6 8 10 12 14 16 18 0

10000

20000

30000

40000

50000

60000

70000

80000

To

tal n

um

be

r o

f p

ee

rs

Nu

mb

er

of

ne

w p

ee

rs

Days

total peernew peers

! long measurements are relevant

!effects of blacklisting and file popularity

Distributed Greedy

Parameters : no-content & random-content

15

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

0 5 10 15 20 25 30 35

Nu

mb

er

of

pe

ers

Days

random contentno content

0

200000

400000

600000

800000

1e+06

1.2e+06

1.4e+06

1.6e+06

1.8e+06

2e+06

0 5 10 15 20 25 30 35

Nu

mb

er

of

RE

QU

ES

T-P

AR

T q

ue

rie

s

Days

random contentno content

HELLO messages REQUEST-part messages

!advantage of sending random content

!global and local blacklisting

Parameters: number of honeypots

16

0

20000

40000

60000

80000

100000

120000

0 5 10 15 20 25

Num

ber

of peers

Number of honeypots

lower bound of 100 samplesavreage of 100 samples

upper bound of 100 samples

! important benefit in using several honeypots

Peer based measurements

17

Peer

Server 1“Rock”

name:ACDC.mp3 ID:101050name:rock.mp3 ID:B16V100

Querying multiple servers helps to discover more filesMore IP addresses can be seen in the same way

Server 2

“Rock”

name:ACDC.mp3 ID:101050name:muse.mp3 ID:G28V07

Methodology & data

!a modified client connects to multiple servers

– queries servers with 15 keywords (8 are paedophile)

– retrieves all filenames and IP addresses

– restarts every 12 hours

!Resulting data set

– 140 days measurements (October 2008 to February 2009)

– ~ 3 millions peers

– ~ 3 millions of distinct files

– ~1.5 millions different filenames

18

Outline

19

1. eDonkey measurements

1.server side

2.honeypot

3.client side

2. results & statistics

3. community detection

Goal & limitations

!rigorous evaluation of several elements

– peers, queries, filenames, ...

!difficulties

– IP equals user ?

– one file, several names

– paedophile query ? paedophile user/IP ?

– no access to files’ content

20

Basic analysis: file sizes

21

175 MB

230 MB

350 MB

700 MB

1 GB

1.4 GB

small files

0

5e+08

1e+09

1.5e+09

2e+09

2.5e+09

1 10 100 1000 10000

!obtained from the server answers

!CD-ROM size and fractions (1/2, 1/3, and 1/4)

! related to classical sizes of storage support

Size (in MB)

# o

f file

s

Basic analysis: time between queries

22

Time (in seconds)

# o

f pee

rs

!regularities of queries

Basic analysis: top 10 words

23

Rank Keyword # occurrences

1 mp3 12 121 0522 avi 2 860 2253 the 2 657 3494 rar 1 610 669

5 de 1 607 6346 jpg 1 296 6107 la 1 236 0018 of 1 082 5219 a 1 039 469

10 mpg 993 077

Rank Keyword # occurrences

1 the 4 147 1972 de 3 382 4733 la 2 337 4044 a 1 761 1795 of 1 751 8486 2 1 398 1547 i 1 153 6018 ita 1 101 9649 2006 1 075 982

10 el 1 025 315

filenames queries

24

Basic analysis: names per file!different filenames but same content

- translation, commas/spaces, fake files, ...

!up to 82 different filenames for one file

!~ 16 millions files with 1 filename

!~ 3 millions files with 2 or more filenames

! problematic for content detection and fake detection systems

name:kung-fu-panda.avi ID:1234

name:kungfupanda-FR.avi ID:1234

name:paedophile-keyword.jpg ID:1234

25

Paedophile content: age detection

!strings containing ages; example: XYyo

!queries targets younger ages than filenames

!more demand than supply

0

2

4

6

8

10

12

14

2 4 6 8 10 12 14 16 18 20

queriesfilenames

Server

Age

% o

f occ

urr

ence

26

Paedophile content: age detection

!similarity in server and client measures

!spikes at 9,12 and 18 years old

Laboratoire d’Informatique de Paris 6

University Pierre & Marie Curie

104, avenue du President Kennedy 75016 Paris

Measurement of paedophile

activity in eDonkeyusing a client sending queriesFiras Bessadok, Karim Bessaoud, Clemence Magnien and Matthieu Latapy

{prenom.nom}@lip6.frhttp://www.complexnetworks.fr

•objective: quantify paedophile activity in the eDonkey network

•method: a client periodically sends queries to eDonkey servers andrecords the results (the laptop with red flag in the figure)

•period: every 12 hours

• in each session, the client sends 15 different queries (8 well-knownpaedophile keywords and 7 non-paedophile ones)

Files collected

•number of files: 2 784 583

Paedophile files collected

•number of paedophiles files: 701 857

Ages in filenames

•77 030 filenames containing ages

• all ages below 21 are represented

• important focus between 8 and 15 years old

• two peaks at 9 and 12 years old

Ongoing and future work

•multiple distributed clients instead of one

• estimation of the real value of several parameters:

–number of paedophile files

–number of copies of a particular paedophile file

•use of multiple-recapture model

Advances in the Analysis of Online Paedophile Activity – June 2009, Paris, France

Client

Age% o

f occ

urr

ence

in fi

lenam

es

Paedophile keywords in queries

!94% of queries contain only one keyword

!keyword 21 is used in 60% of the queries

– is it still a paedophile keyword ?27

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9

Num

ber

of queries

Number of paedophile keywords

Figure 2: Distribution of the number of pae-dophile keywords in queries, i.e. for each number ofpaedophile keywords from our set contained in queries,the number of queries (vertical axis), with logarithmicscale.

Number of keywords in queries (Fig. 2)

The distribution shows that, among queries containingat least one paedophile keyword, most of them containsa single keyword. These results raise the question ofthe correct definition for a paedophile query. Our firstmodel (PQ) might be improved to eliminate querieswith more than x paedophile keywords.

The table below shows the number of paedophile que-ries, for two definitions of a paedophile query. In row A,the number of queries containing exactly x paedophilekeywords. In row B, the number of queries containingat least x paedophile keywords.

1 2 3 4 5 6 7 8 9A 110614 3166 1105 380 78 100 26 13 1B 115483 4869 1703 598 218 140 40 14 1

Paedophile IPs / normal IPs ratio (Fig. 3)

If we now consider NAT or dynamic IP allocation, an IPmay remain paedophile until the end of the ten weeks,even if it is not the same user anymore. This kind ofcontamination may lead to eventually consider everyIP as paedophile. The growing ratio (see Fig. 3) showsthat, even by the end of the experiment, we still discovermany new paedophile IPs in queries.

Total number of queries by keyword (Fig. 4)

One may notice several groups of keywords, sorted byfrequency of appearance. But this plot also shows that aspecific keyword (the right most one) is used in slightlyless than 60% of all the paedophile queries. The gapbetween this paedophile word and others suggests thatit may be a not-so-specific keyword. Further investiga-tion, such as combination of this keyword with others,

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0.0018

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06

Paedophile

IP

s / n

orm

al IP

s r

atio

Elapsed time (seconds)

Figure 3: Time-evolution of the ratio of pae-dophile IPs over normal IPs.

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Perc

enta

ge o

f th

e p

edophile

queries

Number of the keyword in the set

Figure 4: Number of queries by paedophile key-word.

must be carried out to decide whether all queries con-taining this keyword should be considered paedophile.

4. FUTUREWORK

Most of the future work will consist in studying com-bination of keywords, so as to corroborate our hypo-thesis on the model (specific paedophile keywords). Adeeper work on ages will also be carried out. We willprogressively refine our definition of paedophile queriesand IP, before establishing accurate and reliable results.

5. REFERENCES

[1] F. Aidouni, M. Latapy, and C. Magnien, “Tenweeks in the life of an edonkey server,” Proceedingsof HotP2P’09, 2009.

[2] M. Latapy, C. Magnien, and G. Valadon,“First report on database specification and accessincluding content rating and fake detectionsystem,” http://antipedo.lip6.fr, 2008.

2

Number of paedophile keywords

Num

ber

of quer

ies

1

10

100

1000

10000

100000

1 2 3 4 5 6 7 8 9

Num

ber

of queries

Number of paedophile keywords

Figure 2: Distribution of the number of pae-dophile keywords in queries, i.e. for each number ofpaedophile keywords from our set contained in queries,the number of queries (vertical axis), with logarithmicscale.

Number of keywords in queries (Fig. 2)

The distribution shows that, among queries containingat least one paedophile keyword, most of them containsa single keyword. These results raise the question ofthe correct definition for a paedophile query. Our firstmodel (PQ) might be improved to eliminate querieswith more than x paedophile keywords.

The table below shows the number of paedophile que-ries, for two definitions of a paedophile query. In row A,the number of queries containing exactly x paedophilekeywords. In row B, the number of queries containingat least x paedophile keywords.

1 2 3 4 5 6 7 8 9A 110614 3166 1105 380 78 100 26 13 1B 115483 4869 1703 598 218 140 40 14 1

Paedophile IPs / normal IPs ratio (Fig. 3)

If we now consider NAT or dynamic IP allocation, an IPmay remain paedophile until the end of the ten weeks,even if it is not the same user anymore. This kind ofcontamination may lead to eventually consider everyIP as paedophile. The growing ratio (see Fig. 3) showsthat, even by the end of the experiment, we still discovermany new paedophile IPs in queries.

Total number of queries by keyword (Fig. 4)

One may notice several groups of keywords, sorted byfrequency of appearance. But this plot also shows that aspecific keyword (the right most one) is used in slightlyless than 60% of all the paedophile queries. The gapbetween this paedophile word and others suggests thatit may be a not-so-specific keyword. Further investiga-tion, such as combination of this keyword with others,

0.0002

0.0004

0.0006

0.0008

0.001

0.0012

0.0014

0.0016

0.0018

0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06

Paedophile

IP

s / n

orm

al IP

s r

atio

Elapsed time (seconds)

Figure 3: Time-evolution of the ratio of pae-dophile IPs over normal IPs.

0.1

1

10

100

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Perc

enta

ge o

f th

e p

edophile

queries

Number of the keyword in the set

Figure 4: Number of queries by paedophile key-word.

must be carried out to decide whether all queries con-taining this keyword should be considered paedophile.

4. FUTUREWORK

Most of the future work will consist in studying com-bination of keywords, so as to corroborate our hypo-thesis on the model (specific paedophile keywords). Adeeper work on ages will also be carried out. We willprogressively refine our definition of paedophile queriesand IP, before establishing accurate and reliable results.

5. REFERENCES

[1] F. Aidouni, M. Latapy, and C. Magnien, “Tenweeks in the life of an edonkey server,” Proceedingsof HotP2P’09, 2009.

[2] M. Latapy, C. Magnien, and G. Valadon,“First report on database specification and accessincluding content rating and fake detectionsystem,” http://antipedo.lip6.fr, 2008.

2

ID of the keyword%

of pae

dophile

quer

ies

Blind content detectionhttp://crs.complexnetworks.fr/?id=HASH

28

porn1; porn2; pedo1; pedo2; fake1; fake2; fake3; fake4

Perspectives and ongoing work

!definition of paedophile IPs and queries

– one keyword ? several ?

– once count as paedophile always paedophile ? (NAT)

!what is the number of file and paedophile users ?

– “9000 paedophiles on the Internet” http://tinyurl.com/c78vlu

“including 1000 in germany”

29

Outline

30

1. eDonkey measurements

1.server side

2.honeypot

3.client side

2. results & statistics

3. community detection

Goals

!analysis based on the structure

!represent data as a graph (nodes & edges)

– words in queries ? filenames ?

– file-ID ? client-ID ?

!Motivations

– understand the structure

– detect communities of interest

– graph visualization

– improve our knowledge of paedophile keywords

31

What is a community?

• A community is a set of nodes

– nodes share something,

– high level of connection,

– more links inside than outside.

Community detection:hierarchical clustering

Slid

e c

ourt

esy o

f P

ascal P

ons

Community detection:divisive approach

Adapte

d fro

m s

lide c

ourt

esy o

f P

ascal P

ons

Words in filenames:how to build a graph ?

35

les betteraves tueur de coiffeur mp3

Dionysos coiffeur d oiseaux mp3

Resulting graph- 1 693 791 nodes- 73 422 135 links

09/03/2007 Community detection: an overviewCircle of friends on boards.ie© boards.ie

37

Louvain Methodhttp://findcommunities.googlepages.com

Words in filenames

After 5 iterations

! community of 189 words

Community containing 12yo

38

Queries Filenames

graph 3 380 213 1 693 791

iteration 1 2 469 134/107 688 828 913/97 111

iteration 2 785 606/23 680 314 665/30 367

iteration 3 232 620/3 954 13 540/1259

iteration 4 51 026/836 1310/94

iteration 5 1614/70 189/11

BLACK words/nodes in the communityRED well-known (> 100 times in the data set)

Tracking well-known keywords

!keyword 1:

– iteration 4: 1614/70 (6 keywords + 14 ages)

– iteration 5: 411/9 (5 keywords)

• 2 previously unknown keywords

!keyword 2:

– iteration 4: 124/13 (1 keyword)

• 1keyword looks like a well-known one

– iteration 5: 20/3 (1 keyword)

• same look-a-like keyword39

Conclusion!Take away messages

1. several techniques to measure eDonkey

- server, client, honeypot

2. anonymize data set publicly available

3. paedophile content can be identified

!Perspectives

– define a paedophile IP and query

– evaluate the correct number of paedophile files & users

– evolution over time

– measure other P2P networks

– impact of new (french) laws: hadopi, lopsi

40

Questions ?

41

Community detection:Louvain method

42