Upload
lucenerevolution
View
112
Download
2
Tags:
Embed Size (px)
DESCRIPTION
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Citation preview
Search Analytics
Business Value&
NoSQL Backend
Otis Gospodnetić – Sematext International@otisg ◦ @sematext ◦ sematext.com
sematext.com/search-analytics
Copyright 2011 Sematext Int'l. All rights reserved.2
About Otis Gospodnetić
• ASF Member: Lucene, Solr, Nutch, Mahout
• Author: Lucene in Action 1 & 2
• Entrepreneur: Sematext, Simpy
Copyright 2011 Sematext Int'l. All rights reserved.3
Sematext Metrics
100% organic: no GMO, no VC 4 years old < 10 people 7 countries 3 timezones 2 continents > 100 customers
Copyright 2011 Sematext Int'l. All rights reserved.4
About Sematext
Products & Services
Consulting, Development, Tech Support:
Search (Lucene, Solr, ElasticSearch...) Big Data (Hadoop, HBase,
Voldemort...) Web Crawling (Nutch, Droids) Machine Learning (Mahout)
Copyright 2011 Sematext Int'l. All rights reserved.5
Agenda
What is Search Analytics and why it matters
Example reports and their value What we built, why, and how
Copyright 2011 Sematext Int'l. All rights reserved.6
Communication
twitter.com/sematext twitter.com/otisg hash tags: #stsa or #stanalytics http://sematext.com/search-analytics/index.html Raise your hand! [email protected]
Copyright 2011 Sematext Int'l. All rights reserved.7
The Compass
Search logs are your Map
Search Analytics is your Compass
Copyright 2011 Sematext Int'l. All rights reserved.8
High Level Why
searchusers
searchproviders
searchexperience
Copyright 2011 Sematext Int'l. All rights reserved.9
High Level Why
searchproviders
searchexperience
This search sucks!It takes 17 tries to find anything here!
F!?@#$%^&?!?
searchusers
Cool, the latest search tweaks made our site really sticky!
Awesome!
Copyright 2011 Sematext Int'l. All rights reserved.10
Don't Be Like This Dude
Copyright 2011 Sematext Int'l. All rights reserved.11
Got Clue?
Search Analytics
Performance Monitoring
Quality Assurance
Tuning UI
Copyright 2011 Sematext Int'l. All rights reserved.12
More Concrete Why
Measure and monitor everything. Introspection. Supports (re)design, navigation choices Helps with content acquisition & enhancement Improve search experience Mula
Copyright 2011 Sematext Int'l. All rights reserved.13
The Moment of Truth
Question for the audience #1
What do you use for Search Analytics?
a) Home grown stuffb) Google Analyticsc) Omnitured) Webtrendse) Otherf ) Nothing
Copyright 2011 Sematext Int'l. All rights reserved.14
Search Analytics Outline
Collect: queries & clicks & interactions & ... Analyze: actions / xactions / conversions Output: reports – over time Output++: feedback loop
The means, not the goal Ongoing, not one-off
remember this
Copyright 2011 Sematext Int'l. All rights reserved.15
Search vs. Web Analytics
User intent and information needs vs. inferring Hand in hand Ideally you can relate data from both or even
unify it
Copyright 2011 Sematext Int'l. All rights reserved.16
Example Core Reports
Rate & Volume, Latency (mean, avg, 90%) Click Through Rate, Mean Reciprocal Rank Top Queries by count, clicks, 0 hits... Query Trending Top Seen Docs, Top Clicked Docs (msft) Page & Click Depth Facet & Sort Usage ...
Copyright 2011 Sematext Int'l. All rights reserved.17
More Reports in More Detail
See Search Analytics What? Why? How?
http://blog.sematext.com/tag/analytics/
Copyright 2011 Sematext Int'l. All rights reserved.18
Part Dos
Switching gears... Juno digs NoSQL
Copyright 2011 Sematext Int'l. All rights reserved.19
What We've Built
Search Analytics SaaS Numerous reports (e.g. query volume,
rate, latency, term frequencies / comparisons, hit buckets, search origins, etc.)
Trending over time Comparisons of time periods Top N reports Filter, slice and dice
Copyright 2011 Sematext Int'l. All rights reserved.20
Who Needs a Compass?
We need it search-hadoop.com & search-lucene.com
Our customers need it!
You?
Copyright 2011 Sematext Int'l. All rights reserved.21
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.22
Big Dreams
SaaS Multitenant Large Scale – Massive Data Cloud
Copyright 2011 Sematext Int'l. All rights reserved.23
Storage Choices
RDBMS: MySQL, PostgreSQL HDFS Hive HBase Cassandra
Copyright 2011 Sematext Int'l. All rights reserved.24
SaaS vs. In-House
Question for the audience #2
SaaS vs in-house Search Analytics?
a) SaaSb) in-house
Copyright 2011 Sematext Int'l. All rights reserved.25
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.26
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.27
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.28
Sematext Search Analytics
Copyright 2011 Sematext Int'l. All rights reserved.29
Data Flow See Search Analytics with Flume and HBase
http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/
Copyright 2011 Sematext Int'l. All rights reserved.30
Data Collection See Search Analytics with Flume and HBase
http://blog.sematext.com/2010/10/16/search-analytics-hadoop-world-flume-hbase/
Copyright 2011 Sematext Int'l. All rights reserved.31
Core Tech
JavaScript Beacons Metric Capture Web App aka Receiver Flume Agents, Collectors, Sinks HBase MapReduce Aggregations Search Analytics Reporting Web App
Copyright 2011 Sematext Int'l. All rights reserved.32
What is Flume
Distributed data/log collection service Scalable, configurable, extensible Centrally manageable, open source
Agents get data from app, Collectors save it Abstractions: Source → Decorator(s) → Sink
Copyright 2011 Sematext Int'l. All rights reserved.33
What is HBase
Scalable, reliable, distributed, column-oriented DB On top of HDFS MapReducable
Copyright 2011 Sematext Int'l. All rights reserved.34
Data Flow, Detailed
Copyright 2011 Sematext Int'l. All rights reserved.35
Why Flume
Reliable delivery e.g. queue msgs locally if destination unreachable
Easy, centralized management via Web UI or console
Good community, good progress, now @ASF But: more complex, more moving parts On Flume: slideshare.net/cloudera/inside-flume Alternatives: Kafka, Scribe...
Copyright 2011 Sematext Int'l. All rights reserved.36
Why HBase
Scalable raw & aggregate data storage MapReduce data input Fast scans for time ranges, fast key lookups Easy storage and compute power expansion Good looking roadmap, community,
progress
Copyright 2011 Sematext Int'l. All rights reserved.37
Open Sourcing
2 open-source projects:
github.com/sematext/HBaseWD
github.com/sematext/HBaseHUT See sematext.com/open-source/index.html
Patches for Flume and HBaseblog.sematext.com/tag/flume/
Copyright 2011 Sematext Int'l. All rights reserved.38
Challenges
Data size. Solutions: Compression (4-5x smaller with lzo) Data pruning (variable levels)
Query string distribution: very long-tail Lots of data to process, update, aggregate
Young tools: Flume, HBase Poor IO on EC2 Hadoop distributions
Copyright 2011 Sematext Int'l. All rights reserved.39
Output++
AutoComplete - $MM improvement Better DYM Spellchecker Related Searches Recommendations Relevance Feedback ...
Copyright 2011 Sematext Int'l. All rights reserved.40
Closing the Loop
searchusers
searchproviders
searchexperience
Copyright 2011 Sematext Int'l. All rights reserved.41
Resource
http://rosenfeldmedia.com/books/searchanalytics/
Search Analytics for Your SiteLouis Rosenfeld
Copyright 2011 Sematext Int'l. All rights reserved.42
We're Hiring
Dig Search?
Dig Analytics?
Dig Big Data?
Dig Performance?
Dig working with and in open-source?
We're hiring world-wide!
http://sematext.com/about/jobs.html
Copyright 2011 Sematext Int'l. All rights reserved.43
sematext.com blog.sematext.com @sematext @otisg [email protected]
Want SA? Grab me or go to: sematext.com/search-analytics
Hash tags: #stsa or #stanalytics
Contact