20
PageRank + Inverted Index

PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Embed Size (px)

Citation preview

Page 1: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

PageRank + Inverted Index

Page 2: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Un Motor de Búsqueda

Page 3: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

“obama”

Page 4: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

PageRank Model: Final Version• The Web: a directed graph

Vertices(pages)

Edges(links)

f a

e b

d c

Page 5: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Input Structure

• 31.5 million edges• 960,109 nodesdocument-with-link document-linked

Page 6: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 0. Start Downloading Datasets

• http://aidanhogan.com/teaching/cc5212-1/mdp-lab9-data/– page_links_es_f.tsv.gz– wiki_abstracts_es.tsv.gz

– http://aidanhogan.com/teaching/cc5212-1/mdp-lab9.zip

Page 7: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 1. Dictionary Encode Links• Strings difficult to fit in memory• Encode strings as OIDs (object ids = integers)

• Input line:http://es.wikipedia.org/wiki/Ciencia_ficción http://es.wikipedia.org/wiki/Robot

• Output line:1203952673

• Dictionary:12039http://es.wikipedia.org/wiki/Ciencia_ficción …52673http://es.wikipedia.org/wiki/Robot…

• OIDCompress-i [folder]/page_links_es_f.tsv.gz -igz -o [folder]/page_links_es_f.oid.gz -ogz -d [folder]/page_links_es_f.dict.gz -dgz

Page 8: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 2. Copy PageRank Code• Copy PageRankGraph.java from mdp-lab8

to mdp-lab9 (same package)– Use your code to be marked on it!– Marked from 20 for this lab

• If you weren’t here last week, copy PageRankGraph.java from

http://aidanhogan.com/cc5212-1/mdp-lab9-data/– Marked from 10 for this lab

Page 9: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 3. Rank and sort full data

• Run ranking (PageRankGraph.java)– 50 iterations: ITERS = 50-i [folder]/page_links_es_f.oid.gz -igz -o [folder]/page_ranks_es_f.oid.tsv.gz –ogz

• Sort ranks by rank score (SortByRank.java)-i [folder]/page_ranks_es_f.oid.tsv.gz -igz -o [folder]/page_ranks_es_f_s.oid.tsv.gz –ogz

Page 10: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 4. Make Predictions & Bets

Which will be the highest ranked articles in Wikipedia according to PageRank?

Page 11: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 5. Decode the ranks

• Decode the file (OIDDecompress.java)-d [folder]/page_links_es_f.dict.gz -dgz -i [folder]/page_ranks_es_f_s.oid.tsv.gz -igz -n 0 -o [folder]/page_ranks_es_f_s.tsv

• Open the output in a text editor and have a look

Page 12: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 6. Copy Inverted Index Code

• Copy IndexTitleAndAbstract.java and SearchIndex.java from mdp-lab7 into mdp-lab9 (if you were here)

• Otherwise grab them from http://aidanhogan.com/cc5212-1/mdp-lab9-data/

Page 13: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 7. Rebuild Inverted Index

• IndexTitleAndAbstract.java-i [folder]/wiki_abstracts_es.tsv.gz -igz -o [folder]/es_wiki_index/

• Try searches using SearchIndex.java– Copy the top 10 results for 5 searches including ‘obama’

and ‘universidad’ into a text file somewhere

Page 14: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 8. Add in the boost values

• Open BoostRanks.java• Follow the board to code

• Run:-o [folder]/es_wiki_index/ -i [folder]/page_ranks_es_f_s.tsv

Page 15: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Step 9. Profit• Re-run the same five queries as before over

the boosted index and see if the results improve

• http://www.lucenetutorial.com/lucene-query-syntax.html

Page 16: PageRank + Inverted Index. Un Motor de Búsqueda “obama”
Page 17: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Course Marking• 45% for Weekly Labs (~3% a lab!)• 35% for Final Exam• 20% for Small Class Project

Page 18: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

Class Project• Done in pairs (Except Alejandro/Mauricio :P)

• Goal: Use what you’ve learned to do something cool (basically)

• Expected difficulty: More than a lab’s worth – But from scratch / without my help!

• Marked on: Difficulty, appropriateness, scale, good use of techniques, presentation, coolness– Ambition is appreciated, even if you don’t succeed: feel free to bite

off more than you can chew!

• Process: – Pair up (default random) by Wednesday– Decide on a topic (by June 9th) or let me assign one – If you need data or get stuck, I will (try to) help out

• Deliverables: 10 minute presentation (June 23rd) & 4-page report– 2 weeks!

Page 19: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

GroupsPairings:• Catalina Espinoza y Felipe Quintanilla• Eduardo Acha y Jaime Salas • Francisca Concha y Nicolás Miranda

Lone agents:• Alejandro Infante• Mauricio Quezada

Page 20: PageRank + Inverted Index. Un Motor de Búsqueda “obama”

TopicsLet’s talk topics

– Catalina Espinoza y Felipe Quintanilla– Eduardo Acha y Jaime Salas – Francisca Concha y Nicolás Miranda– Mauricio Quezada

• What’s the idea?• What will be the result of your project?• How much data will you process/where will you

source it?• Which techniques from the class will you use?• How cool is it?