Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Paahuni Khandelwal Email: [email protected]
1st November, 2019
[Recitation 9]
CS 435 - Introduction to Big Data
Submission
2
o Submission Deadline for PA3 - 14th November (by 5 pm)o Last day for PA2 Demos - Monday, 4th November o Programming Languages allowed : Scala, Java
Introduction to Apache Spark
3
• Download latest Apache Spark 2.4.4 binary (pre-build for Hadoop 2.7 and later): [Link]
• Go through its official documentation: [Link]
• Create Maven project and setup Spark on lDE to run WordCount program
• Try to setup Spark cluster on top of your Hadoop cluster. [Link]
Programming Assignment 3
• Estimate PageRank values under ideal conditions• Estimate PageRank values while considering dead-end articles (Taxation) • Create Wikipedia Bomb
4
DEMO
• Apache Spark installation• Using spark shell for debugging• Compiling Spark program and running on cluster• Running Spark program locally using IDE
5
Lambda Function Word Count • In Java
wordCount = inputFile.flatMap(s -> Arrays.asList(s.split(" ")).iterator()).mapToPair(word ->
newTuple2<>(word, 1)).reduceByKey((a, b) -> a + b);
• In Scala
wordCount = textFile.flatMap(line =>line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)
6
Datasets for PA3
We have two datasets: Links Dataset and Titles Dataset :
Links-Simple-Sorted: • Each line represents the outgoing links from a page • The format of the lines is:
from1 : to11 to12 to13
from2 : to21 to22 to23
...
Titles-Sorted:• Each line represents title of Wikipedia article• To find the page title that corresponds to integer n, just look up the n-th line in the Titles-Sorted dataset. Code Snippet: val titles =sc.textFile(PATH_TO_TITLES).zipWithIndex().mapValues(x=>x+1).map(_.swap);
JavaPairRDD<Long,String> titles = spark.read().textFile(PATH_TO_TITLES).javaRDD().zipWithIndex().mapToPair(x-> new Tuple2 <>(x._2()+1,x._1()));
7
Top 10 pages
• After the PageRank for the web graph converges (or after 25 iterations), show top 10 pages with highest page ranks. Code snippets :
spark.sql("SELECT ID FROM RANK_TABLE ORDER BY RANK DESC LIMIT 10");
• Then join the updated ranks (top 10) with Titles dataset to get the corresponding titles. Code snippet :
spark.sql(“SELECT ID,TITLE FROM TITLE_TABLE WHERE ID IN(SELECT ID FROM RANK_TABLE ORDER BY RANK DESC LIMIT 10)");
8
Wiki Bomb
1. Get a sub-graph containing the word “surfing”. (First filter out the links dataset) • Run PageRank on the sub-graph.
• At the end, the top 10 links output should have “rocky_mountain_national_park” as the first page irrespective of cases
2. Filter out the Titles that contains query word (in this case, surfing) • Get the corresponding Links for the above Titles (this is our new sub-graph)
• If you run PageRank on this sub-graph, the top ten pages will most likely contain pages with surfing word in them
• But, for Wikipedia bomb, we want Rocky Mountain National Park at the top of that top-10 list
9
Thank you