Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Paahuni KhandelwalEmail: [email protected]
6th September, 2019
[Recitation 2]
CS 435 - Introduction to Big Data
Agenda
• Wrapping up PA0
• Introduction to Programming Assignment 1
2CS 435: Introduction to Big Data [Fall 2019]
PA0 (Recap)
• Programming Assignment 1 comprises of PA0+ PA1 (3+7 = 10 points)
• Submission through canvas + Demo (includes viva)
• PA0 Demo – Monday, Tuesday and Wednesday
• If no time slot works out, then mail us at [email protected]
3CS 435: Introduction to Big Data [Fall 2019]
Demo
• Test file uploaded on Monday morning
• Running hdfs, running resource manager
• Performing word count on given test file using
resource manager (yarn)
• Questions on your Non-MapReduce word count
program
• Viva will include questions which shows your
understanding on MapReduce and wordcount
program
4CS 435: Introduction to Big Data [Fall 2019]
Common Issues faced during Setup
• Using JDK 64 – bit compiled version for $JAVA_HOME
• Yarn Resource manager not starting
• Port conflicts
• Looking at logs – present in $HADOOP_LOG_DIR and $YARN_LOG_DIR
• Dfs getting corrupted – remove tmp files
• Memory issues – Adding properties to mapred-site.xml and yarn-site.xml
5CS 435: Introduction to Big Data [Fall 2019]
Word Count Program
6
Programming Assignment 1
• This will be worth 7 points (7% of your total grades)
• Submission due on 30Th September by 5 pm
• Uploaded on CS435 assignment page
7
Dataset
• Summary of appr. 1.5 million Wikipedia articles
• Total size of dataset: ~1 GB
• How will your input file look like?
Title_of_Article-1<====>DocumentID<====>Summary_of_Article-1
NEWLINE
NEWLINE
Title_of_Article-2<====>DocumentID<====>Summary_of_Article-2
NEWLINE
NEWLINE
8
PA1 : N-gram
• Contiguous sequence of N items from a given sample of text or speech
Text: type of language model for predicting the next item in a sequence
Example:
– Unigram ( 1- gram): type, of ,language, model …………,sequence
– Biagram (2-grams): (__, type), (type, of), (of, language), (language, model), (model, for) ……….. (in,
sequence), (sequence, ___)
– Trigram (3-grams): (__, type, of), (type, of, language), …………….
9
PA1 requires three Profiles
PROFILE 1
o First 500 unigrams in the dataset
o Must be sorted in alphabetical order (ascending)
o Eliminate duplicates
o For designing you can use combiner to eliminate local duplicates
10
PA1 requires three Profiles
PROFILE 2
o A list of top 500 unigrams and their frequencies within each article
o Similar to word-count program
o Each article has unique integral Document ID
o Output should be grouped by Document ID
o Output files will correspond to the number of reducer used
o Requires two jobs to be executed in a sequence
11
PROFILE 3
o A list of top 500 unigrams and their frequencies in the corpus
o List should be sorted from most frequent unigrams to least frequent ones.
o Solution to generate profile 3 is a combination of profile 1 and profile 2
o Here, we list unigram with its total occurrence in the complete dataset
(without considering per article construct as in profile 2)
12
PA1 requires three Profiles
Pre-processing of Dataset
o Consider only alphabetic and numeric text.
o Convert upper cases to lower cases.
o Example (Original → processed → toLowerCase)
D.I.A. → DIA → dia
South. → South → south
(U-S-A) → USA → usa
D.C., → DC → dc
9.8 → 98 → 98
13
Few tips
o Make a design your system first
o Make sure you understand Word-Count program
o Start writing program by making modification in given Word-Count program
o Test your program in standalone/local mode with given test file(~95 Kb)
o Verify your output in even smaller text file, if required, before moving on to larger files
o Parsing (preprocessing) the dataset will be an important part and will require
considerable effort. Spend some time to observe the dataset provided.
14
Thank you
15
Questions?