10
Paahuni Khandelwal Email: [email protected] 1 st November, 2019 [Recitation 9] CS 435 - Introduction to Big Data

CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Paahuni Khandelwal Email: [email protected]

1st November, 2019

[Recitation 9]

CS 435 - Introduction to Big Data

Page 2: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Submission

2

o Submission Deadline for PA3 - 14th November (by 5 pm)o Last day for PA2 Demos - Monday, 4th November o Programming Languages allowed : Scala, Java

Page 3: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Introduction to Apache Spark

3

• Download latest Apache Spark 2.4.4 binary (pre-build for Hadoop 2.7 and later): [Link]

• Go through its official documentation: [Link]

• Create Maven project and setup Spark on lDE to run WordCount program

• Try to setup Spark cluster on top of your Hadoop cluster. [Link]

Page 4: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Programming Assignment 3

• Estimate PageRank values under ideal conditions• Estimate PageRank values while considering dead-end articles (Taxation) • Create Wikipedia Bomb

4

Page 5: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

DEMO

• Apache Spark installation• Using spark shell for debugging• Compiling Spark program and running on cluster• Running Spark program locally using IDE

5

Page 6: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Lambda Function Word Count • In Java

wordCount = inputFile.flatMap(s -> Arrays.asList(s.split(" ")).iterator()).mapToPair(word ->

newTuple2<>(word, 1)).reduceByKey((a, b) -> a + b);

• In Scala

wordCount = textFile.flatMap(line =>line.split(" ")).map(word => (word, 1)).reduceByKey(_+_)

6

Page 7: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Datasets for PA3

We have two datasets: Links Dataset and Titles Dataset :

Links-Simple-Sorted: • Each line represents the outgoing links from a page • The format of the lines is:

from1 : to11 to12 to13

from2 : to21 to22 to23

...

Titles-Sorted:• Each line represents title of Wikipedia article• To find the page title that corresponds to integer n, just look up the n-th line in the Titles-Sorted dataset. Code Snippet: val titles =sc.textFile(PATH_TO_TITLES).zipWithIndex().mapValues(x=>x+1).map(_.swap);

JavaPairRDD<Long,String> titles = spark.read().textFile(PATH_TO_TITLES).javaRDD().zipWithIndex().mapToPair(x-> new Tuple2 <>(x._2()+1,x._1()));

7

Page 8: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Top 10 pages

• After the PageRank for the web graph converges (or after 25 iterations), show top 10 pages with highest page ranks. Code snippets :

spark.sql("SELECT ID FROM RANK_TABLE ORDER BY RANK DESC LIMIT 10");

• Then join the updated ranks (top 10) with Titles dataset to get the corresponding titles. Code snippet :

spark.sql(“SELECT ID,TITLE FROM TITLE_TABLE WHERE ID IN(SELECT ID FROM RANK_TABLE ORDER BY RANK DESC LIMIT 10)");

8

Page 9: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Wiki Bomb

1. Get a sub-graph containing the word “surfing”. (First filter out the links dataset) • Run PageRank on the sub-graph.

• At the end, the top 10 links output should have “rocky_mountain_national_park” as the first page irrespective of cases

2. Filter out the Titles that contains query word (in this case, surfing) • Get the corresponding Links for the above Titles (this is our new sub-graph)

• If you run PageRank on this sub-graph, the top ten pages will most likely contain pages with surfing word in them

• But, for Wikipedia bomb, we want Rocky Mountain National Park at the top of that top-10 list

9

Page 10: CS 435 - Introduction to Big Datacs435/PA/PA3/Recitation9.pdf · CS 435 - Introduction to Big Data. Submission 2 o Submission Deadline for PA3 - 14th November (by 5 pm) o Last day

Thank you