Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
!"#$%&$"'"#() *$"#+,"-()./')$-0
10(01
2+%3,* ) 4""$56,)7$+/ 8%$*+))$#
9(! 1,:%;:,+('&,+%($"/'/. :"(3<'"#+$%(+=$"#+#'+%$(>
!"##$%&'
?(! @:$))$('AA%,3<$(*.#<,=,),/+;:$(>
B(! 1%.-$"#'#+,"(=:(",:&$':(C),3(*'%;:$
D(! 4"","3$($#(=.A),+$*$"#(=:(",:&$':(C),3(*'%;:$
Measurement of paedophile activityin eDonkey
Guillaume Valadon - http://valadon.complexnetworks.fr
http://antipaedo.lip6.fr
LIP6 (CNRS - UPMC)
Complex Networks teamhttp://complexnetworks.fr
The team!http://complexnetworks.fr : plots & videos
– 4 permanent members : Jean-Loup Guillaume, Matthieu Latapy, Bénédicte Le Grand, Clémence Magnien
– 2 postdocs, 9 Ph.D. students
! Focus & interests:
– Internet topology, P2P networks, social networks
– measurements
– analysis2
What is peer-to-peer ?!exchanges do not rely on a server
– content is inside peers or created by peers
!peers are equal
– clients and servers at the same time
!peer removal not problematic
!Usages:
– file sharing: eDonkey, bittorrent
– telephony: skype
– video streaming: joost, PPlive
3
!servers are catalogs of files
!peers:
! inform the servers
!search files on servers
!download files from peers
What is eDonkey ?
4
eDonkeyServer
eDonkeyServer
eDonkeyServer
Peer 1
Peer 2
“I have file A”
“I am looking for file A”
“I want to download file A”
Context
!study exchanges in eDonkey
– files diffusion
– communities of interests
– popularity
!some motivations
– understand users’ behaviour
– blind content detection
– detect paedophile activities
5
Outline
6
1. eDonkey measurements
1.server side
2.honeypot
3.client side
2. results & statistics
3. community detection
eDonkey exchanges:what can be observed ?
1. inter-server communications
statistical data about servers usages & peers
2. peer-server communications
index of files, file search, source search
3. inter-peer communications
file downloads, retrieve lists of files
7
Looking for a file
8
Clie
nt
Serv
eur
Sources search
Keywords search
Sources list
Files list
Pe
er
Se
rve
r
1. “music mp3”
2. name: ACDC.mp3
ID:101050
name:rock.mp3
ID:B16V100
3. ID:B16V100
4. 192.168.0.2
1.2.3.4
1
2
3
4
Capturing traffic on a real server
9
Capture clienteDonkey server
PCAP dump
PCAP flow
PCAP decoding
eDonkey traffic
UDP traffic
inspection
Anonymisation andformatting
XML encoding
eDonkey exchanges
<opcode dir="received" TS="2786402.373146" IP="0045125351"
type="high" port="02029"><OP_GLOBSEARCHREQ>
<tags count="1"><anon-string>3108886</anon-string></tags>
</OP_GLOBSEARCHREQ></opcode>
words that appearless than 100 times
Resulting data set in numbers [HotP2P’09]
!10 weeks measurements
!~500 GB of compressed XML
!~ 10 billions messages
!~ 90 millions peers
!~ 280 millions of distinct files
! anonymized data available online at http://antipaedo.lip6.fr
10
Honeypot based measurements[HotP2P’09]
11
!eDonkey honeypot:
– customized eDonkey client
– announce files to a server (filename, ID, size)
– log queries made by regular peers
!Manager:
– control distributed honeypots
– send commands to honeypots: server to connect to, files to exchange, ...
Downloading a file
12
Send nothing
Send random content
Retrieve list of files
owned by the peer
Methodology
13
!24 PlanetLab nodes, running distributed honeypots:
– 12 sending no content
– 12 sending random content
! 1 greedy honeypot:
– learn files during the first day
– afterwards, announce these files
distributed greedy
Honeypots 24 1
Duration in days 32 15
Shared files 4 3 175
Distinct peers 110 049 871 445
Distinct files 28 007 267 047
Parameters : distributed or greedy
14
0
20000
40000
60000
80000
100000
120000
0 5 10 15 20 25 30 35 0
1000
2000
3000
4000
5000
6000
Tota
l num
ber
of peers
Num
ber
of new
peers
Days
total peernew peers
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
0 2 4 6 8 10 12 14 16 18 0
10000
20000
30000
40000
50000
60000
70000
80000
To
tal n
um
be
r o
f p
ee
rs
Nu
mb
er
of
ne
w p
ee
rs
Days
total peernew peers
! long measurements are relevant
!effects of blacklisting and file popularity
Distributed Greedy
Parameters : no-content & random-content
15
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0 5 10 15 20 25 30 35
Nu
mb
er
of
pe
ers
Days
random contentno content
0
200000
400000
600000
800000
1e+06
1.2e+06
1.4e+06
1.6e+06
1.8e+06
2e+06
0 5 10 15 20 25 30 35
Nu
mb
er
of
RE
QU
ES
T-P
AR
T q
ue
rie
s
Days
random contentno content
HELLO messages REQUEST-part messages
!advantage of sending random content
!global and local blacklisting
Parameters: number of honeypots
16
0
20000
40000
60000
80000
100000
120000
0 5 10 15 20 25
Num
ber
of peers
Number of honeypots
lower bound of 100 samplesavreage of 100 samples
upper bound of 100 samples
! important benefit in using several honeypots
Peer based measurements
17
Peer
Server 1“Rock”
name:ACDC.mp3 ID:101050name:rock.mp3 ID:B16V100
Querying multiple servers helps to discover more filesMore IP addresses can be seen in the same way
Server 2
“Rock”
name:ACDC.mp3 ID:101050name:muse.mp3 ID:G28V07
Methodology & data
!a modified client connects to multiple servers
– queries servers with 15 keywords (8 are paedophile)
– retrieves all filenames and IP addresses
– restarts every 12 hours
!Resulting data set
– 140 days measurements (October 2008 to February 2009)
– ~ 3 millions peers
– ~ 3 millions of distinct files
– ~1.5 millions different filenames
18
Outline
19
1. eDonkey measurements
1.server side
2.honeypot
3.client side
2. results & statistics
3. community detection
Goal & limitations
!rigorous evaluation of several elements
– peers, queries, filenames, ...
!difficulties
– IP equals user ?
– one file, several names
– paedophile query ? paedophile user/IP ?
– no access to files’ content
20
Basic analysis: file sizes
21
175 MB
230 MB
350 MB
700 MB
1 GB
1.4 GB
small files
0
5e+08
1e+09
1.5e+09
2e+09
2.5e+09
1 10 100 1000 10000
!obtained from the server answers
!CD-ROM size and fractions (1/2, 1/3, and 1/4)
! related to classical sizes of storage support
Size (in MB)
# o
f file
s
Basic analysis: time between queries
22
Time (in seconds)
# o
f pee
rs
!regularities of queries
Basic analysis: top 10 words
23
Rank Keyword # occurrences
1 mp3 12 121 0522 avi 2 860 2253 the 2 657 3494 rar 1 610 669
5 de 1 607 6346 jpg 1 296 6107 la 1 236 0018 of 1 082 5219 a 1 039 469
10 mpg 993 077
Rank Keyword # occurrences
1 the 4 147 1972 de 3 382 4733 la 2 337 4044 a 1 761 1795 of 1 751 8486 2 1 398 1547 i 1 153 6018 ita 1 101 9649 2006 1 075 982
10 el 1 025 315
filenames queries
24
Basic analysis: names per file!different filenames but same content
- translation, commas/spaces, fake files, ...
!up to 82 different filenames for one file
!~ 16 millions files with 1 filename
!~ 3 millions files with 2 or more filenames
! problematic for content detection and fake detection systems
name:kung-fu-panda.avi ID:1234
name:kungfupanda-FR.avi ID:1234
name:paedophile-keyword.jpg ID:1234
25
Paedophile content: age detection
!strings containing ages; example: XYyo
!queries targets younger ages than filenames
!more demand than supply
0
2
4
6
8
10
12
14
2 4 6 8 10 12 14 16 18 20
queriesfilenames
Server
Age
% o
f occ
urr
ence
26
Paedophile content: age detection
!similarity in server and client measures
!spikes at 9,12 and 18 years old
Laboratoire d’Informatique de Paris 6
University Pierre & Marie Curie
104, avenue du President Kennedy 75016 Paris
Measurement of paedophile
activity in eDonkeyusing a client sending queriesFiras Bessadok, Karim Bessaoud, Clemence Magnien and Matthieu Latapy
{prenom.nom}@lip6.frhttp://www.complexnetworks.fr
•objective: quantify paedophile activity in the eDonkey network
•method: a client periodically sends queries to eDonkey servers andrecords the results (the laptop with red flag in the figure)
•period: every 12 hours
• in each session, the client sends 15 different queries (8 well-knownpaedophile keywords and 7 non-paedophile ones)
Files collected
•number of files: 2 784 583
Paedophile files collected
•number of paedophiles files: 701 857
Ages in filenames
•77 030 filenames containing ages
• all ages below 21 are represented
• important focus between 8 and 15 years old
• two peaks at 9 and 12 years old
Ongoing and future work
•multiple distributed clients instead of one
• estimation of the real value of several parameters:
–number of paedophile files
–number of copies of a particular paedophile file
•use of multiple-recapture model
Advances in the Analysis of Online Paedophile Activity – June 2009, Paris, France
Client
Age% o
f occ
urr
ence
in fi
lenam
es
Paedophile keywords in queries
!94% of queries contain only one keyword
!keyword 21 is used in 60% of the queries
– is it still a paedophile keyword ?27
1
10
100
1000
10000
100000
1 2 3 4 5 6 7 8 9
Num
ber
of queries
Number of paedophile keywords
Figure 2: Distribution of the number of pae-dophile keywords in queries, i.e. for each number ofpaedophile keywords from our set contained in queries,the number of queries (vertical axis), with logarithmicscale.
Number of keywords in queries (Fig. 2)
The distribution shows that, among queries containingat least one paedophile keyword, most of them containsa single keyword. These results raise the question ofthe correct definition for a paedophile query. Our firstmodel (PQ) might be improved to eliminate querieswith more than x paedophile keywords.
The table below shows the number of paedophile que-ries, for two definitions of a paedophile query. In row A,the number of queries containing exactly x paedophilekeywords. In row B, the number of queries containingat least x paedophile keywords.
1 2 3 4 5 6 7 8 9A 110614 3166 1105 380 78 100 26 13 1B 115483 4869 1703 598 218 140 40 14 1
Paedophile IPs / normal IPs ratio (Fig. 3)
If we now consider NAT or dynamic IP allocation, an IPmay remain paedophile until the end of the ten weeks,even if it is not the same user anymore. This kind ofcontamination may lead to eventually consider everyIP as paedophile. The growing ratio (see Fig. 3) showsthat, even by the end of the experiment, we still discovermany new paedophile IPs in queries.
Total number of queries by keyword (Fig. 4)
One may notice several groups of keywords, sorted byfrequency of appearance. But this plot also shows that aspecific keyword (the right most one) is used in slightlyless than 60% of all the paedophile queries. The gapbetween this paedophile word and others suggests thatit may be a not-so-specific keyword. Further investiga-tion, such as combination of this keyword with others,
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
0.0018
0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06
Paedophile
IP
s / n
orm
al IP
s r
atio
Elapsed time (seconds)
Figure 3: Time-evolution of the ratio of pae-dophile IPs over normal IPs.
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Perc
enta
ge o
f th
e p
edophile
queries
Number of the keyword in the set
Figure 4: Number of queries by paedophile key-word.
must be carried out to decide whether all queries con-taining this keyword should be considered paedophile.
4. FUTUREWORK
Most of the future work will consist in studying com-bination of keywords, so as to corroborate our hypo-thesis on the model (specific paedophile keywords). Adeeper work on ages will also be carried out. We willprogressively refine our definition of paedophile queriesand IP, before establishing accurate and reliable results.
5. REFERENCES
[1] F. Aidouni, M. Latapy, and C. Magnien, “Tenweeks in the life of an edonkey server,” Proceedingsof HotP2P’09, 2009.
[2] M. Latapy, C. Magnien, and G. Valadon,“First report on database specification and accessincluding content rating and fake detectionsystem,” http://antipedo.lip6.fr, 2008.
2
Number of paedophile keywords
Num
ber
of quer
ies
1
10
100
1000
10000
100000
1 2 3 4 5 6 7 8 9
Num
ber
of queries
Number of paedophile keywords
Figure 2: Distribution of the number of pae-dophile keywords in queries, i.e. for each number ofpaedophile keywords from our set contained in queries,the number of queries (vertical axis), with logarithmicscale.
Number of keywords in queries (Fig. 2)
The distribution shows that, among queries containingat least one paedophile keyword, most of them containsa single keyword. These results raise the question ofthe correct definition for a paedophile query. Our firstmodel (PQ) might be improved to eliminate querieswith more than x paedophile keywords.
The table below shows the number of paedophile que-ries, for two definitions of a paedophile query. In row A,the number of queries containing exactly x paedophilekeywords. In row B, the number of queries containingat least x paedophile keywords.
1 2 3 4 5 6 7 8 9A 110614 3166 1105 380 78 100 26 13 1B 115483 4869 1703 598 218 140 40 14 1
Paedophile IPs / normal IPs ratio (Fig. 3)
If we now consider NAT or dynamic IP allocation, an IPmay remain paedophile until the end of the ten weeks,even if it is not the same user anymore. This kind ofcontamination may lead to eventually consider everyIP as paedophile. The growing ratio (see Fig. 3) showsthat, even by the end of the experiment, we still discovermany new paedophile IPs in queries.
Total number of queries by keyword (Fig. 4)
One may notice several groups of keywords, sorted byfrequency of appearance. But this plot also shows that aspecific keyword (the right most one) is used in slightlyless than 60% of all the paedophile queries. The gapbetween this paedophile word and others suggests thatit may be a not-so-specific keyword. Further investiga-tion, such as combination of this keyword with others,
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
0.0018
0 1e+06 2e+06 3e+06 4e+06 5e+06 6e+06 7e+06
Paedophile
IP
s / n
orm
al IP
s r
atio
Elapsed time (seconds)
Figure 3: Time-evolution of the ratio of pae-dophile IPs over normal IPs.
0.1
1
10
100
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Perc
enta
ge o
f th
e p
edophile
queries
Number of the keyword in the set
Figure 4: Number of queries by paedophile key-word.
must be carried out to decide whether all queries con-taining this keyword should be considered paedophile.
4. FUTUREWORK
Most of the future work will consist in studying com-bination of keywords, so as to corroborate our hypo-thesis on the model (specific paedophile keywords). Adeeper work on ages will also be carried out. We willprogressively refine our definition of paedophile queriesand IP, before establishing accurate and reliable results.
5. REFERENCES
[1] F. Aidouni, M. Latapy, and C. Magnien, “Tenweeks in the life of an edonkey server,” Proceedingsof HotP2P’09, 2009.
[2] M. Latapy, C. Magnien, and G. Valadon,“First report on database specification and accessincluding content rating and fake detectionsystem,” http://antipedo.lip6.fr, 2008.
2
ID of the keyword%
of pae
dophile
quer
ies
Blind content detectionhttp://crs.complexnetworks.fr/?id=HASH
28
porn1; porn2; pedo1; pedo2; fake1; fake2; fake3; fake4
Perspectives and ongoing work
!definition of paedophile IPs and queries
– one keyword ? several ?
– once count as paedophile always paedophile ? (NAT)
!what is the number of file and paedophile users ?
– “9000 paedophiles on the Internet” http://tinyurl.com/c78vlu
“including 1000 in germany”
29
Outline
30
1. eDonkey measurements
1.server side
2.honeypot
3.client side
2. results & statistics
3. community detection
Goals
!analysis based on the structure
!represent data as a graph (nodes & edges)
– words in queries ? filenames ?
– file-ID ? client-ID ?
!Motivations
– understand the structure
– detect communities of interest
– graph visualization
– improve our knowledge of paedophile keywords
31
What is a community?
• A community is a set of nodes
– nodes share something,
– high level of connection,
– more links inside than outside.
Community detection:hierarchical clustering
Slid
e c
ourt
esy o
f P
ascal P
ons
Community detection:divisive approach
Adapte
d fro
m s
lide c
ourt
esy o
f P
ascal P
ons
Words in filenames:how to build a graph ?
35
les betteraves tueur de coiffeur mp3
Dionysos coiffeur d oiseaux mp3
Resulting graph- 1 693 791 nodes- 73 422 135 links
09/03/2007 Community detection: an overviewCircle of friends on boards.ie© boards.ie
37
Louvain Methodhttp://findcommunities.googlepages.com
Words in filenames
After 5 iterations
! community of 189 words
Community containing 12yo
38
Queries Filenames
graph 3 380 213 1 693 791
iteration 1 2 469 134/107 688 828 913/97 111
iteration 2 785 606/23 680 314 665/30 367
iteration 3 232 620/3 954 13 540/1259
iteration 4 51 026/836 1310/94
iteration 5 1614/70 189/11
BLACK words/nodes in the communityRED well-known (> 100 times in the data set)
Tracking well-known keywords
!keyword 1:
– iteration 4: 1614/70 (6 keywords + 14 ages)
– iteration 5: 411/9 (5 keywords)
• 2 previously unknown keywords
!keyword 2:
– iteration 4: 124/13 (1 keyword)
• 1keyword looks like a well-known one
– iteration 5: 20/3 (1 keyword)
• same look-a-like keyword39
Conclusion!Take away messages
1. several techniques to measure eDonkey
- server, client, honeypot
2. anonymize data set publicly available
3. paedophile content can be identified
!Perspectives
– define a paedophile IP and query
– evaluate the correct number of paedophile files & users
– evolution over time
– measure other P2P networks
– impact of new (french) laws: hadopi, lopsi
40