Parallel SPAM Clustering with Hadoop

Embed Size (px)

Citation preview

  • 1. Parallel Spam Clustering with Apache HadoopThibault Debatty

2. Spam 70% of total email volume Estimated cost : $20.5 billion/year To fight better, need better strategic knowledge Examples : Guaranteed Results Make YourPenis 3-inches longer & thicker, girl will love you 1kThibault DebattyParallel Spam Clustering with Apache Hadoop 2 3. Spam 70% of total email volume Estimated cost : $20.5 billion/year To fight better, need better strategic knowledge Examples : Guaranteed ResultsClose IP Make YourPenis 3-inches longer & thicker, girl will Same domain love you 1kThibault DebattyParallel Spam Clustering with Apache Hadoop 3 4. Problem statement Cluster spams in parallel : To get useful insights Fast! Dataset : 1 million spams (231MB)Thibault DebattyParallel Spam Clustering with Apache Hadoop 4 5. Problem statement Subject Your Special Order #253650 Charset windows-1250 Geo GB Day 2010-10-01 Hostvirginmedia.com ip82.4.229.158 Langenglish Size1482 [email protected] [email protected] DebattyParallel Spam Clustering with Apache Hadoop 5 6. Whats next...1. MapReduce and Apache Hadoop2. Parallel K-means3. Implementation4. Benchmarks and speedup analysis5. Clusters vizualisationThibault DebattyParallel Spam Clustering with Apache Hadoop 6 7. 1. MapReduce Model for processing large data sets Master node splits and distributes dataset 2 steps : 1.Map : worker nodes process data, and pass partial results to master 2.Reduce : master combines partial results Also name of Googles implementationThibault Debatty Parallel Spam Clustering with Apache Hadoop 7 8. 1. Apache Hadoop Free implementation of MapReduce Written in Java Process large amounts of data (PB) Used by : Yahoo : + 10.000 cores Facebook : 30 PB of data Distributed filesystem (HDFS) + data localityThibault Debatty Parallel Spam Clustering with Apache Hadoop 8 9. 1. Apache Hadoop Job Tracker Master Divides input data into splits Schedules map tasks (with data locality) Schedules reduce tasks on nodes Checks tasks healthThibault Debatty Parallel Spam Clustering with Apache Hadoop 9 10. 1. Apache HadoopThibault Debatty Parallel Spam Clustering with Apache Hadoop 10 11. 2. KMeans Select initial centers Until stop criterion is reached : Assign each point to closest center Compute new center Advantages : Suited to large datasets Can be implemented in parallel Computation O(nki)Thibault DebattyParallel Spam Clustering with Apache Hadoop 11 12. 2. Parallel KMeans Parallel K-Means Clustering Based on MapReduce Weizhong Zhao, Huifang Ma and Qing He Map (point) : Compute distance to each center Output Reduce (list of points) : Compute center Output

Thibault DebattyParallel Spam Clustering with Apache Hadoop 12 13. 3. Implementation : KMeans Abstract KMeans Abstract KMeansMapper Abstract KmeansReducer Interface IPoint Interface ICenter 2 concrete implementations : Spam Simple 2D pointsThibault DebattyParallel Spam Clustering with Apache Hadoop 13 14. 3. Implementation : Abstract KMeans//Writeto"/it_0/part00000"this.writeInitialCentroids();for(){conf.setMapperClass(this.mapper);conf.setReducerClass(this.reducer);conf.setInt("iteration",iteration);SetOutputPath(..."/it_"+(iteration+1));...}Thibault DebattyParallel Spam Clustering with Apache Hadoop 14 15. 3. Implementation : Abstract KMeansMapperpublicvoidconfigure(JobConfjob){//readsfrom//"/it_"+job.get("iteration")+"/partxxxxx"this.fetchCenters(job);}publicvoidmap(key,value,...){IPointpoint=this.createPointInstance();point.parse(value);...}publicabstractIPointcreatePointInstance();publicabstractICentercreateCenterInstance();Thibault Debatty Parallel Spam Clustering with Apache Hadoop 15 16. 3. Implementation : Abstract KMeansReducerpublicvoidreduce(key,values,){new_center=this.createCenterInstance();new_center.setOldCenter(old_center);while(values.hasNext()){new_center.addPoint(point);}new_center.compute();output.collect(new_center);}publicabstractIPointcreatePointInstance();publicabstractICentercreateCenterInstance();Thibault Debatty Parallel Spam Clustering with Apache Hadoop 16 17. 3. Implementation : Spam Clustering Distance between spams : Weighted Average of feature distances Text features : Jaro distanceThibault DebattyParallel Spam Clustering with Apache Hadoop 17 18. 3. Implementation : Spam Clustering Jaro similarity = Where : m = number of matching characters; t = number matching characters not located at the same position / 2. Matching = not farther than => Takes misspelling into accountThibault DebattyParallel Spam Clustering with Apache Hadoop 18 19. 3. Implementation : Spam Clustering Distance between spams : Weighted Average of feature distances Text features : Jaro distance IP : Number of different bits / 32 Size : max 10% difference Day : arctangent-shaped functionThibault Debatty Parallel Spam Clustering with Apache Hadoop 19 20. 3. Implementation : Spam ClusteringThibault DebattyParallel Spam Clustering with Apache Hadoop 20 21. 3. Implementation : Spam Clustering Center of cluster : Text features : Longest Common Subsequence; Charset, Geo (country code), Lang, Day : most often occurring value; Size : average value.Thibault DebattyParallel Spam Clustering with Apache Hadoop 21 22. 4. Benchmarks Small Cluster : 3 nodes Single core 2GB RAM Gigabit Ethernet network Data replication : 3Thibault Debatty Parallel Spam Clustering with Apache Hadoop 22 23. 4. Benchmarks n = 1M spams k = 30 i = 10 => 1131 secThibault Debatty Parallel Spam Clustering with Apache Hadoop 23 24. 4. Benchmarks : scalability 3500 3000 2500Execution time (sec) 2000 1500 1000 500 0 1 node 2 nodes3 nodesThibault Debatty Parallel Spam Clustering with Apache Hadoop 24 25. 4. Benchmarks : scalabilityThibault DebattyParallel Spam Clustering with Apache Hadoop 25 26. 4. Benchmarks : Hadoop OverheadSequential : 2424 sec3 servers (theoretic) :808 sec3 servers (real) : 1131 secOverhead : 323 sec (40%)Thibault Debatty Parallel Spam Clustering with Apache Hadoop 26 27. 4. Benchmarks : Hadoop OverheadSequential : 2424 sec3 servers (theoretic) :808 sec3 servers (real) : 1131 secOverhead : 323 sec (40%)MPI JumpshotThibault Debatty Parallel Spam Clustering with Apache Hadoop 27 28. 4. Benchmarks : Hadoop OverheadSequential : 2424 sec3 servers (theoretic) :808 sec3 servers (real) : 1131 secOverhead : 323 sec (40%)No data (setup) :76 sec(9.5%)Trivial distance (setup + sort) : 242 secSort : 166 sec (20.5%)Remaining :81 sec(10%)Thibault Debatty Parallel Spam Clustering with Apache Hadoop28 29. 4. Benchmarks : Weka and Mahout 10 million 2D points Weka (sequential) 5355 sec Hadoop: 1841 sec (2.9x faster) Mahout+ 4h ?Thibault Debatty Parallel Spam Clustering with Apache Hadoop29 30. 4. Benchmarks Bigger cluster : 27 nodes 2 x 4 cores 16 GB Deployment: Shared home dir (NFS) Custom setup script Executed on all nodesthrough SSHThibault DebattyParallel Spam Clustering with Apache Hadoop 30 31. 4. Benchmarks : Cluster 1M spams Small cluster :Bigger cluster : 3 cores 216 cores k = 30k = 4000 1131 sec2484 secThibault Debatty Parallel Spam Clustering with Apache Hadoop 31 32. 4. Benchmarks : Comparison Small cluster : Bigger cluster :x 72 3 cores216 cores x 133 k = 30 k = 4000 1131 sec 2484 sec Expected : 2089 sec Difference : 19%Thibault DebattyParallel Spam Clustering with Apache Hadoop32 33. 4. Benchmarks : Profiling and optimization With String dates : With timestamps : - 32% 1131 sec 770 secThibault DebattyParallel Spam Clustering with Apache Hadoop33 34. 5. Results "Your receipt #"From: ""To: "@domain4.com" LinkedIn Messages, /0/2010"From: "[email protected]"To: "@domain0140.com" ""From: "[email protected]"To: "@domain4.c"Thibault DebattyParallel Spam Clustering with Apache Hadoop 34 35. 5. Results Visualization "eil rder #" From: "[email protected]"Thibault DebattyParallel Spam Clustering with Apache Hadoop 35 36. Conclusion Hadoop allows faster clustering But: Limitations Lacks graphical performance analysis tool (MPI Jumpshot) Programmer needs to understand inner working! Lot of room for improvement: Memcached to store intermediate centers? MPI to intercept method calls between JVMs? Selection of initial centers (canopy?), stop criterion? Distance computation (WOWA) Clustering algorithm (online clustering) Influence of data locality and data size?Thibault Debatty Parallel Spam Clustering with Apache Hadoop 36 37. Questions ?Thibault DebattyParallel Spam Clustering with Apache Hadoop 37