Upload
russell-jones
View
214
Download
1
Embed Size (px)
Citation preview
PageRank + Inverted Index
Un Motor de Búsqueda
“obama”
PageRank Model: Final Version• The Web: a directed graph
Vertices(pages)
Edges(links)
f a
e b
d c
Input Structure
• 31.5 million edges• 960,109 nodesdocument-with-link document-linked
Step 0. Start Downloading Datasets
• http://aidanhogan.com/teaching/cc5212-1/mdp-lab9-data/– page_links_es_f.tsv.gz– wiki_abstracts_es.tsv.gz
– http://aidanhogan.com/teaching/cc5212-1/mdp-lab9.zip
Step 1. Dictionary Encode Links• Strings difficult to fit in memory• Encode strings as OIDs (object ids = integers)
• Input line:http://es.wikipedia.org/wiki/Ciencia_ficción http://es.wikipedia.org/wiki/Robot
• Output line:1203952673
• Dictionary:12039http://es.wikipedia.org/wiki/Ciencia_ficción …52673http://es.wikipedia.org/wiki/Robot…
• OIDCompress-i [folder]/page_links_es_f.tsv.gz -igz -o [folder]/page_links_es_f.oid.gz -ogz -d [folder]/page_links_es_f.dict.gz -dgz
Step 2. Copy PageRank Code• Copy PageRankGraph.java from mdp-lab8
to mdp-lab9 (same package)– Use your code to be marked on it!– Marked from 20 for this lab
• If you weren’t here last week, copy PageRankGraph.java from
http://aidanhogan.com/cc5212-1/mdp-lab9-data/– Marked from 10 for this lab
Step 3. Rank and sort full data
• Run ranking (PageRankGraph.java)– 50 iterations: ITERS = 50-i [folder]/page_links_es_f.oid.gz -igz -o [folder]/page_ranks_es_f.oid.tsv.gz –ogz
• Sort ranks by rank score (SortByRank.java)-i [folder]/page_ranks_es_f.oid.tsv.gz -igz -o [folder]/page_ranks_es_f_s.oid.tsv.gz –ogz
Step 4. Make Predictions & Bets
Which will be the highest ranked articles in Wikipedia according to PageRank?
Step 5. Decode the ranks
• Decode the file (OIDDecompress.java)-d [folder]/page_links_es_f.dict.gz -dgz -i [folder]/page_ranks_es_f_s.oid.tsv.gz -igz -n 0 -o [folder]/page_ranks_es_f_s.tsv
• Open the output in a text editor and have a look
Step 6. Copy Inverted Index Code
• Copy IndexTitleAndAbstract.java and SearchIndex.java from mdp-lab7 into mdp-lab9 (if you were here)
• Otherwise grab them from http://aidanhogan.com/cc5212-1/mdp-lab9-data/
Step 7. Rebuild Inverted Index
• IndexTitleAndAbstract.java-i [folder]/wiki_abstracts_es.tsv.gz -igz -o [folder]/es_wiki_index/
• Try searches using SearchIndex.java– Copy the top 10 results for 5 searches including ‘obama’
and ‘universidad’ into a text file somewhere
Step 8. Add in the boost values
• Open BoostRanks.java• Follow the board to code
• Run:-o [folder]/es_wiki_index/ -i [folder]/page_ranks_es_f_s.tsv
Step 9. Profit• Re-run the same five queries as before over
the boosted index and see if the results improve
• http://www.lucenetutorial.com/lucene-query-syntax.html
Course Marking• 45% for Weekly Labs (~3% a lab!)• 35% for Final Exam• 20% for Small Class Project
Class Project• Done in pairs (Except Alejandro/Mauricio :P)
• Goal: Use what you’ve learned to do something cool (basically)
• Expected difficulty: More than a lab’s worth – But from scratch / without my help!
• Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness– Ambition is appreciated, even if you don’t succeed: feel free to bite
off more than you can chew!
• Process: – Pair up (default random) by Wednesday– Decide on a topic (by June 9th) or let me assign one – If you need data or get stuck, I will (try to) help out
• Deliverables: 10 minute presentation (June 23rd) & 4-page report– 2 weeks!
GroupsPairings:• Catalina Espinoza y Felipe Quintanilla• Eduardo Acha y Jaime Salas • Francisca Concha y Nicolás Miranda
Lone agents:• Alejandro Infante• Mauricio Quezada
TopicsLet’s talk topics
– Catalina Espinoza y Felipe Quintanilla– Eduardo Acha y Jaime Salas – Francisca Concha y Nicolás Miranda– Mauricio Quezada
• What’s the idea?• What will be the result of your project?• How much data will you process/where will you
source it?• Which techniques from the class will you use?• How cool is it?