View
222
Download
0
Category
Preview:
Citation preview
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 1/30
Copyright 2010 Sematext Int'l. All rights reserved.
ProjectHub
Crawling, Indexing, and Searching Software Project Data
with Droids, Tika, Solr & friends
Otis Gospodneti otis@sematext.com @otisg
Sematext Int'l www.sematext.com @sematext
1
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 2/30
Copyright 2010 Sematext Int'l. All rights reserved.
What I Will Cover
Who I am WhatWhy Where
Architecture
Info Gathering & Indexing
Search & Extra Search Dog Food
Performance & Analytics
Ops & Stats
2
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 3/30
Copyright 2010 Sematext Int'l. All rights reserved.
About Otis Gospodneti
Lucene/Solr/Nutch /Mahout/... committer
Lucene in Action 1 & 2 co-author
Lucene Consulting since 2005
Sematext International since 2007
3
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 4/30
Copyright 2010 Sematext Int'l. All rights reserved.
About Sematext
Search (Lucene, Solr, Elastic Search...)
Web Crawling (Nutch)
Machine Learning (Mahout)
Big Data (Hadoop, HBase, Voldemort...)
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 5/30
Copyright 2010 Sematext Int'l. All rights reserved.
What
Search ever ything about a Software Project Lucene & Hadoop
± All sub-projects
± All content
Mailing list archives
JIRA issues
Web site & Wiki pages
Source code (local syntax highlighting), trunk
Javadoc, trunk
5
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 6/30
Copyright 2010 Sematext Int'l. All rights reserved. 6
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 7/30
Copyright 2010 Sematext Int'l. All rights reserved.
Why
We need it Other Hadoop, Lucene, Solr... users need it
Our own playground
Live product demos
Yummy dog food
7
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 8/30
Copyright 2010 Sematext Int'l. All rights reserved.
Where
search-lucene.com search-hadoop.com
Other suggestions / needs?
In your Enterprise?
8
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 9/30
Copyright 2010 Sematext Int'l. All rights reserved.
Architecture
9
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 10/30
Copyright 2010 Sematext Int'l. All rights reserved.
Tool Matrix
Data Source Fetch Parse
JIRA URLConnection (feed) Digester (feed) DOM (item)
ML FileInputStream (fs)URLConnection (feed)
Droid (works, unused)
Digester (feed) MIME4J (mbox)
Web site Droids Tika via Droids
Wiki Droids Tika via Droids
Source code svn co QDox
Javadoc svn co QDox
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 11/30
Copyright 2010 Sematext Int'l. All rights reserved.
Information Gathering
Multiple independent JVM processes (cron) Different polling frequencies
Different data sources / formats:
± RSS (JIRA, Mailing Lists)
± Mbox (Mailing Lists)
± HTTP/HTML (Web site, Wiki)
± Subversion (source code, Javadoc)
Nutch is a beast. Droids is light & simple.
ML thread detection is tricky
Finding deleted docs (Wiki, Web, Javadoc...)
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 12/30
Copyright 2010 Sematext Int'l. All rights reserved.
Thread Detection
Email clients are kaput SMTP headers are unreliable
Heuristics are needed
± Try headers
± Fall back to subjects (get subject skeleton,
calculate hash)
± Factor in time (4 weeks)
± Use index for thread info retrieval
Q: Are there any libraries for this?
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 13/30
Copyright 2010 Sematext Int'l. All rights reserved.
Indexing
Use StreamingUpdateSolrServer AutoCommit use-case
Solr index abuse: track seen/unseen
&qsrc=indexer
&warmUp=true
Separate processes ± easier reindexing (esp.
with frequent project infra changes)
Treating quoted portions of ML messages
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 14/30
Copyright 2010 Sematext Int'l. All rights reserved.
Search
Facets (multi-select) ± Project
± Data source/type
± Author (based on names only)
Boosting more recent documents vs. pure
relevance vs. newest/oldest firstgive equivalent of 0.5 year to docs w/ empty updateDate field (e.g. javadocs)
recip(map(ms(NOW,updateDate),6.32e11,3.16e12,1.58e10),3.16e-11,4,1)^4
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 15/30
Copyright 2010 Sematext Int'l. All rights reserved.
Search cont'd
Quer y Spellchecker Sematext components:
± ReSearcher & Relaxer
± AutoComplete
± Key Phrase Extractor (2 approaches)
Threaded vs. flat view
In-document search term highlighting
Short URLs
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 16/30
Copyright 2010 Sematext Int'l. All rights reserved.
Search cont'd
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 17/30
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #1: Auto-Complete
Source: nightly refreshed subject and titles Approach: go directly to selection
sematext.com/products/autocomplete/
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 18/30
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #2: ReSearcher &
Relaxer
Avoid ³sorr y, no/poor matches´ Multiple algos trigger re-searching
Different forms of relaxing
sematext.com/products/dym-researcher/
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 19/30
Copyright 2010 Sematext Int'l. All rights reserved.
Dog food #3: Key Phrases
Help narrow search results, like facets 2 types:
± Stored in index vs. calculated from top N hits
sematext.com/products/key-phrase-extractor/
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 20/30
Copyright 2010 Sematext Int'l. All rights reserved.
Basic Search Analytics
Top queries, top terms... Daily, weekly, monthly
MRR
http://en.wikipedia.org/wiki/Mean_reciprocal_rank
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 21/30
Copyright 2010 Sematext Int'l. All rights reserved.
Ver y Basic Search Analytics
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 22/30
Copyright 2010 Sematext Int'l. All rights reserved.
Real Search AnalyticsTohelp protectyour privacy, PowerPointprevented thisexternalpicturefrom being automatically downloaded. Todownload and display thispicture, click Optionsin theMessageBar, and then click Enableexternalcontent.
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 23/30
Copyright 2010 Sematext Int'l. All rights reserved.
Performance & Monitoring: RPM
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 24/30
Copyright 2010 Sematext Int'l. All rights reserved.
Availability: Site24x7.com
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 25/30
Copyright 2010 Sematext Int'l. All rights reserved.
Operations
Small EC2 instance: 1.7 GB RAM EBS for data - got burnt once
Local disk for index
Solr 1.4.1 multi-core
Performance monitoring via RPM
Availability & performance via site24x7.com
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 26/30
Copyright 2010 Sematext Int'l. All rights reserved.
Statistics
search-hadoop.com: ± 110K+ documents
± ~700 MB optimized
search-lucene.com
± 170K+ documents
± ~900 MB optimized
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 27/30
Copyright 2010 Sematext Int'l. All rights reserved.
Future
Field collapsing (threads) Bot detection (load) DONE
Solr duplicate detection (release notes)
Relevance tuning (MRR)
Open sourcing?
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 28/30
Copyright 2010 Sematext Int'l. All rights reserved.
World-wide!
Search & Data Analytics
Machine Learning & NLP
Big Data
jobs@sematext.com
WE ARE HIRING
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 29/30
Copyright 2010 Sematext Int'l. All rights reserved.
Questions
8/8/2019 Crawling, Indexing, and Searching Software Project Data with Droids, Tika, Solr & friends
http://slidepdf.com/reader/full/crawling-indexing-and-searching-software-project-data-with-droids-tika 30/30
Copyright 2010 Sematext Int'l. All rights reserved.
Contact
sematext.com blog.sematext.com
@sematext
@otisg
otis@sematext.com
30
Recommended