CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Paahuni Khandelwal Email: [email protected]

1st November, 2019

[Recitation 9]

CS 435 - Introduction to Big Data

mailto:[email protected]

Submission

2

o Submission Deadline for PA3 - 14th November (by 5 pm)o Last day for PA2 Demos - Monday, 4th November o Programming Languages allowed : Scala, Java

Introduction to Apache Spark

3

• Download latest Apache Spark 2.4.4 binary (pre-build for Hadoop 2.7 and later): [Link]

• Go through its official documentation: [Link]

• Create Maven project and setup Spark on lDE to run WordCount program

• Try to setup Spark cluster on top of your Hadoop cluster. [Link]

https://spark.apache.org/downloads.html

https://spark.apache.org/docs/latest/quick-start.html

https://www.cs.colostate.edu/~cs435/PA/PA3/Apache-Spark.pdf

Programming Assignment 3

• Estimate PageRank values under ideal conditions• Estimate PageRank values while considering dead-end articles (Taxation) • Create Wikipedia Bomb

4

DEMO

• Apache Spark installation• Using spark shell for debugging• Compiling Spark program and running on cluster• Running Spark program locally using IDE

5

Lambda Function Word Count • In Java

wordCount = inputFile.flatMap(s -> Arrays.asList(s.split(" ")).iterator()).mapToPair(word ->

newTuple2<>(word, 1)).reduceByKey((a, b) -> a + b);

• In Scala

wordCount = textFile.flatMap(line =>line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)

6

Datasets for PA3

We have two datasets: Links Dataset and Titles Dataset :

Links-Simple-Sorted: • Each line represents the outgoing links from a page • The format of the lines is:

from1 : to11 to12 to13

from2 : to21 to22 to23

...

Titles-Sorted:• Each line represents title of Wikipedia article• To find the page title that corresponds to integer n, just look up the n-th line in the Titles-Sorted dataset. Code Snippet: val titles =sc.textFile(PATH_TO_TITLES).zipWithIndex().mapValues(x=>x+1).map(_.swap);

JavaPairRDD<Long,String> titles = spark.read().textFile(PATH_TO_TITLES).javaRDD().zipWithIndex().mapToPair(x-> new Tuple2 <>(x._2()+1,x._1()));

7

Top 10 pages

• After the PageRank for the web graph converges (or after 25 iterations), show top 10 pages with highest page ranks. Code snippets :

spark.sql("SELECT ID FROM RANK_TABLE ORDER BY RANK DESC LIMIT 10");

• Then join the updated ranks (top 10) with Titles dataset to get the corresponding titles. Code snippet :

spark.sql(“SELECT ID,TITLE FROM TITLE_TABLE WHERE ID IN(SELECT ID FROM RANK_TABLE ORDER BY RANK DESC LIMIT 10)");

8

Wiki Bomb

1. Get a sub-graph containing the word “surfing”. (First filter out the links dataset) • Run PageRank on the sub-graph.

• At the end, the top 10 links output should have “rocky_mountain_national_park” as the first page irrespective of cases

2. Filter out the Titles that contains query word (in this case, surfing) • Get the corresponding Links for the above Titles (this is our new sub-graph)

• If you run PageRank on this sub-graph, the top ten pages will most likely contain pages with surfing word in them

• But, for Wikipedia bomb, we want Rocky Mountain National Park at the top of that top-10 list

9

Thank you

Documents

CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day