ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4,...

ProjFocusedCrawlerCS5604 Information Storage and Retrieval, Fall 2012

Virginia TechDecember 4, 2012

Mohamed M. G. FaragMohammed Saquib Khan

Prasad Krishnamurthi GaneshGaurav Mishra

Outline

• Project description• Deliverables• Architecture• Roles• Progress• Milestones• Problems and challenges

Project Description(from course Scholar webpage)

• CTRnet uses Heritrix– Quality of the seeds, and configuration details

• Focused crawlers – use topic information (which links to follow) – reduce noise, and increase precision– may reduce recall

• Goal– improve upon existing solutions– demonstrate effectiveness w.r.t. CTRnet efforts

Deliverables(from course Scholar webpage)

• Design and implementation of a prototype focused crawler

• New collection for CTRnet built using crawler• Software, data, report, and future plans• Publication about this research, and content

for NSF proposal

Architecture

• Mohamed M. G. Farag:– Building classifier– Preparing training data (lead)– Integrating different components

• Mohamed Saquib Khan:– Building classifier (lead)– Preparing training data– Comparing different classifiers

Roles (cont’d)

• Prasad Krishnamurthi Ganesh:– Preprocessing (lead)– Preparing training data– Evaluation

• Guarav:– Preprocessing– Preparing training data– Building documents TF-IDF vectors (lead)

Preparing training data

• Warc files extraction• Sikkim earthquake collection– 19 seed URLs– ~2000 HTML files out of 9000 files

• Filtering – Keywords, selected manually– Relevance -> k or more words from keywords– K = 1 ~50% relevance (high recall, low precision)– We used k = 5

Problems faced

• Warc extraction– Files saved on disk, original URL, wayback machine

URL– Encoding problems of saved on disk files– Original URLs, not working (page not found)– Wayback machine URLs, webpages have extra

content that needs to be parsed and removed

Modular design

Input representation

• Seed pages– Vector space model– Normalized TF weighting scheme• Problem of using IDF of relevant documents only

• Training and testing data set– Features, vector space model– TF-IDF weighting scheme

Crawling

• Get a URL from priority queue• Check if URL is:– Visited (i.e., its webpage is already downloaded)– In Queue (i.e., no need to put it again in queue)

• Download • Extract text and URLs • Estimate relevance• Put URLs in priority queue• Stop if queue is empty or reach pages limit

Text extraction problems

• Extract visual text of a webpage• Heuristic approaches• Scripts and comments tags remain after

extraction (need explicit manipulation)• Invalid HTML tags (missing closing brackets)

How to Crawl

• Three ways:– URLs only: anchor text and address (doesn’t

describe the topic of target webpage)– Webpage text only: time and space, all URLs get

same score, many non relevant URLs– Hybrid: if webpage is relevant, use relevant URLs

Relevance estimation

• TF-IDF/cosine similarity score– Assumed relevant if score more than threshold

• Classifier– Naïve Bayes– Support Vector Machine (SVM)

Evaluation

• Precision– Number of downloaded webpages that are

relevant• Classifier evaluation– Performance on test data– Cross validation, parameters best values

• Performance– Ordering of URLs in priority queue

Results

• Egyptian revolution, threshold = 0.1

• Precision = 633/1208 = 0.52URL Score

http://botw.org/top/Regional/Africa/Egypt/Society_and_Culture/Politics/Protests_2011/ 1

http://live.reuters.com/Event/Unrest_in_Egypt?Page=0 1

http://www.aljazeera.com/indepth/spotlight/anger-in-egypt/ 1

http://www.guardian.co.uk/world/series/egypt-protests 1

http://www.huffingtonpost.com/2012/06/24/egypt-uprising-election-timeline_n_1622773.html 1

http://www.washingtonpost.com/wp-srv/world/special/egypt-transition-timeline/index.html 1

http://www.guardian.co.uk/news/blog/2011/feb/09/egypt-protests-live-updates-9-february 0.52428340

http://www.guardian.co.uk/world/blog/2011/feb/05/egypt-protests 0.50552904

http://www.guardian.co.uk/world/blog/2011/feb/11/egypt-protests-mubarak 0.50212776

http://www.guardian.co.uk/news/blog/2011/feb/08/egypt-protests-live-updates 0.47775149

• Without feature selection, k = 5 for filtering• Accuracy = 0.845132743363

• With feature selection (Chi-square) and k = 5 for filtering

• Accuracy = 0.898230088496

Naïve Bayes

• Same result for using/not using feature selection and k = 5 for filtering

• Accuracy = 0.845132743363•

Future work

• Domain ontology• Tunneling• Non-relevant webpage can lead to relevant ones

– Ontology-based relevance estimation• Concepts comparison

MilestonesMilestone Date

Preparing training data (Oct. 15) Building general crawler (Oct. 30)Building classifiers (Nov. 15)Integrating components (Nov. 30)Testing and evaluation (Dec. 5)Completing report and prototype (Dec. 11)

Thank you!

Questions ?

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4,...

Documents

Solr Team CS5604: Cloudera Search in IDEAL Nikhil ... Team CS5604: Cloudera Search in IDEAL Nikhil Komawar, Ananya Choudhury, Rich Gruss Tuesday May 5, 2015 Department of Computer

The Essence of JavaScript Arjun Guha, Claudiu Saftoiu, and Shriram Krishnamurthi

Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox

By Sheela Kotagiri and Ganesh Krishnamurthi 6/18/2015

1 CS5604 October 13, 2010 “5S Overview for Modules” by Edward A. Fox and Lillian (Boots) Cassel (on Ensemble) fox@vt.edu Dept. of

Optimization of Threshold for Energy Based Spectrum ...file.scirp.org/pdf/WET20110300002_49561709.pdf... Arun Shivaram Pasupathy, Maheshkumar Mani, Santhoshkumar Krishnamurthi, Sathiesh

2003 Saquib Titanium Dioxide Mediated Photocatalyzed Degradation of a Textile Dye Derivative, Acid Orange 8, In Aqueous Suspensions

CS5604:(InformationStorage(andRetrieval Term(Project

KRISHNAMURTHI PADATHI - THE SUB SUBS

Respiratory Distress in Neonates Respiratory Distress in Neonates Dr.Mohammad Saquib Mallick, FRCS Consultant Paediatric Surgeon, Consultant Paediatric

Measuring the Effectiveness of Error Messages Designed for Novice Programmers Guillaume Marceau Kathi Fisler (WPI) Shriram Krishnamurthi (Brown) Danny

CS5604( Midterm Presentation) – October 13, 2010 Virginia Polytechnic Institute and State University Presented by: Team 4 (Sarosh, Sony, Sherif)

MCMC Algorithms - Saquib

Introduction to Concept Maps Edward A. Fox and Rao Shen CS5604 Fall 2002 “Information Storage & Retrieval” Dept. of Computer Science Virginia Tech, Blacksburg,

Common Neonatal Emergencies. Dr. Mohammad Saquib Mallick,FRCS Dr. Mohammad Saquib Mallick,FRCS Consultant…

SAT Applications Tutorial plus a pinch of Margrave Tim Nelson Shriram Krishnamurthi Brown University 1

CS5604, Information Retrieval, Fall 2016 Final ... · CS5604, Information Retrieval, Fall 2016 Mitch Wagner Faiz Abidi Shuangfei Fan. Additions regarding tweet updates Before Now

Hashing: Collision Resolution Schemes - KFUPMfaculty.kfupm.edu.sa/ICS/saquib/ICS202/Unit29_Hashing2.pdf · Hashing: Collision Resolution Schemes ... cucumber 3 4.50 mushroom 3 5.50

CS5604: Final Presentaon ProjOpenDSA: Log Support...– Python, Django, MySQL, HTML/CSS/JS Progress • Midterm Presentaon: – Learned soware tools – Designed individual views –

Arjun GuhaShriram Krishnamurthi Trevor Jim Presented by Visoki Elina