CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows

Paahuni KhandelwalEmail: [email protected]

6th September, 2019

[Recitation 2]

CS 435 - Introduction to Big Data

mailto:[email protected]

Agenda

• Wrapping up PA0

• Introduction to Programming Assignment 1

2CS 435: Introduction to Big Data [Fall 2019]

PA0 (Recap)

• Programming Assignment 1 comprises of PA0+ PA1 (3+7 = 10 points)

• Submission through canvas + Demo (includes viva)

• PA0 Demo – Monday, Tuesday and Wednesday

• If no time slot works out, then mail us at [email protected]


Demo

• Test file uploaded on Monday morning

• Running hdfs, running resource manager

• Performing word count on given test file using

resource manager (yarn)

• Questions on your Non-MapReduce word count

program

• Viva will include questions which shows your

understanding on MapReduce and wordcount

program


Common Issues faced during Setup

• Using JDK 64 – bit compiled version for $JAVA_HOME

• Yarn Resource manager not starting

• Port conflicts

• Looking at logs – present in $HADOOP_LOG_DIR and $YARN_LOG_DIR

• Dfs getting corrupted – remove tmp files

• Memory issues – Adding properties to mapred-site.xml and yarn-site.xml


Word Count Program

6

Programming Assignment 1

• This will be worth 7 points (7% of your total grades)

• Submission due on 30Th September by 5 pm

• Uploaded on CS435 assignment page

7

Dataset

• Summary of appr. 1.5 million Wikipedia articles

• Total size of dataset: ~1 GB

• How will your input file look like?

Title_of_Article-1<====>DocumentID<====>Summary_of_Article-1

NEWLINE

NEWLINE

Title_of_Article-2<====>DocumentID<====>Summary_of_Article-2

NEWLINE

NEWLINE

8

PA1 : N-gram

• Contiguous sequence of N items from a given sample of text or speech

Text: type of language model for predicting the next item in a sequence

Example:

– Unigram ( 1- gram): type, of ,language, model …………,sequence

– Biagram (2-grams): (__, type), (type, of), (of, language), (language, model), (model, for) ……….. (in,

sequence), (sequence, ___)

– Trigram (3-grams): (__, type, of), (type, of, language), …………….

9

PA1 requires three Profiles

PROFILE 1

o First 500 unigrams in the dataset

o Must be sorted in alphabetical order (ascending)

o Eliminate duplicates

o For designing you can use combiner to eliminate local duplicates

10


PROFILE 2

o A list of top 500 unigrams and their frequencies within each article

o Similar to word-count program

o Each article has unique integral Document ID

o Output should be grouped by Document ID

o Output files will correspond to the number of reducer used

o Requires two jobs to be executed in a sequence

11

PROFILE 3

o A list of top 500 unigrams and their frequencies in the corpus

o List should be sorted from most frequent unigrams to least frequent ones.

o Solution to generate profile 3 is a combination of profile 1 and profile 2

o Here, we list unigram with its total occurrence in the complete dataset

(without considering per article construct as in profile 2)

12


Pre-processing of Dataset

o Consider only alphabetic and numeric text.

o Convert upper cases to lower cases.

o Example (Original → processed → toLowerCase)

D.I.A. → DIA → dia

South. → South → south

(U-S-A) → USA → usa

D.C., → DC → dc

9.8 → 98 → 98

13

Few tips

o Make a design your system first

o Make sure you understand Word-Count program

o Start writing program by making modification in given Word-Count program

o Test your program in standalone/local mode with given test file(~95 Kb)

o Verify your output in even smaller text file, if required, before moving on to larger files

o Parsing (preprocessing) the dataset will be an important part and will require

considerable effort. Spend some time to observe the dataset provided.

14

Thank you

15

Questions?

Documents

CS 435 - Introduction to Big Datacs435/PA/PA1/Fall2019/...resource manager (yarn) • Questions on your Non-MapReduce word count program • Viva will include questions which shows