4
Search Bootstrapping How / Where to get started

Search Bootstrapping How / Where to get started. Crawling Start with Nutch – Index directly to SOLR –

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Search Bootstrapping

How / Where to get

started

Page 2: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Crawling

• Start with Nutch– http://nutch.apache.org/

• Index directly to SOLR– http://www.lucidimagination.com/blog/2010/09/10

/refresh-using-nutch-with-solr/

• Create a seed list from DMOZ rdf– http://www.dmoz.org/rdf.html– http://wiki.apache.org/nutch/NutchTutorial

Page 3: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Understanding Content

• Entity Extraction– LingPipe http://alias-i.com/lingpipe/– OpenNLP http://incubator.apache.org/opennlp/

• Entity Identification / Taxonomies– Freebase http://www.freebase.com/

Page 4: Search Bootstrapping How / Where to get started. Crawling Start with Nutch –  Index directly to SOLR –

Some Additional Links

• Basic Web Page Parser– https://github.com/pjaol/Webcrawler

• Example of OpenNLP usage– https://github.com/pjaol/entity_extractor