This is the slides used in our 3-hour tutorial at VLDB'2014. Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014) Abstract: Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.
Citation preview
1. Enterprise Search in the Big Data Era Yunyao Li Ziyang Liu
Huaiyu Zhu IBM Research - Almaden NEC Labs IBM Research -
Almaden
2. 1 Enterprise Search Providing intuitive access to an
organizations various digital content 1 Report Find IDC report [IDC
05] $5k/person/year wasted salary due to poor search
9-10hr/person/week doing search unsuccessful 1/3-1/2 of the time
Butler Group [Edwards 06] 10% of salary cost wasted through
ineffective search Accenture survey [Accenture 07] Middle managers
spend 2 hr/day searching >50% of what they found have not value
Hawking, Enterprise Search,
http://david-hawking.net/pubs/ModernIR2_Hawking_chapter.pdf [IDC
05] the enterprise workplace: How it will change the way we work.
IDC Report 32919 [Edwards 06]
www.butlergroup.com/pdf/PressReleases/ESRReportPressRelease.pdf
[Accenture 07]
http://newsroom.accenture.com/article_display.cfm?article_id=4484
3. 2 Magic Search from Users Point of View Results 1 .. 2 . 3
.. 4 . INTRODUCTION SEARCH
4. 3 What Happens Behind the Scene Backend Collect data Analyze
data Index data Frontend Serve user queries Return results Index
Data Source INTRODUCTION SEARCH
5. 4 How Does a Query Match a Document? Index Document .. .. ..
Document .. .. .. Results Doc 1 .. Doc 2 . Doc 3 .. Doc 4 . Analyze
query Present results Analyze document Search index Build index
INTRODUCTION SEARCH
6. 5 Search Is More Than Keyword Match Specific features in
documents are important Title, url, person name, product, actions,
Features combine to form higher level concepts In document: Home
page + person personal homepage Cross document: URL link analysis,
The string representation in document may not match that in user
query Person name: Bill Clinton William Jefferson Clinton User
queries may be ambiguous Multiple interpretations Presenting the
results to user Ranking, grouping, interactive refinement
INTRODUCTION SEARCH
7. 6 Internet vs Enterprise Web data [Fagin WWW2003] Internet
Enterprise Creation of content Democratic Appealing to reader Links
approval Bureaucratic Conform to mandate Links internal structure
Relevant query results Large number Overlapping information
Reasonable subset suffices Ranking is more universal Small number
Specific function Specific pages required Ranking is relative to
query Spamming Spam infested Ranking can only be based on external
authority Mostly spam-free Ranking based on content or metadata are
reliable Search engine friendliness Web pages designed to be search
results Web page document Documents not designed to be search
results Special treatment INTRODUCTION ENTERPRISE VS INTERNET
8. 7 Internet vs Enterprise Big Data Internet Enterprise
Content being searched Sources: Web crawl Formats: html, xml, pdf,
Variety of sources Variety of formats: Email, database,
application- specific access and formats Search queries /expected
results Target: web pages, office documents Expect list of
documents Expect little personalization Return result directly
Target: rows, figures, experts, ... Expect customized results
Personalization required: geography, access, Customize results
Related information Link approval Small number of domain- specific
knowledge Generic analysis Link organization structure Large number
of dynamic domain-specific knowledge Highly specialized analysis
Skill set of search admins Large number of admins Search experts
Facilitate update of search algorithms Small number of admins
Domain experts Facilitate use of domain knowledge INTRODUCTION
ENTERPRISE VS INTERNET
9. 8 Search Engine Components Backend Collect data Analyze data
Store and index data Admin System performance Search quality
control/improvement Frontend Interpret user query Search index
Present results Interact with user index Data source INTRODUCTION
TUTORIAL OVERVIEW COMPONENTS
10. 9 Search Engine Architecture Backend Collect data Analyze
data Store and index data Backend Collect data Analyze data Store
and index data Admin System performance Search quality control
Frontend Interpret user query Search index Present results Interact
with user index Data source
11. 10 Main Backend Functions Analysis (Understand) Information
extraction Analyse and transform data Indexing (Prepare for search)
Generate terms suitable for match queries Index search terms index
Document Ingestion (Collect) Collect all the data to be searched
Transform and store as documents Local Analysis (in-document
analysis) Global Analysis (cross-document analysis)
12. 11 Backend Section Outline Overview Data Ingestion Local
analysis Global analysis Indexing
13. 12 Typical analytics pipeline S1={f11, f12, } S2={f21, f22,
} S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data
ingestion Collect data Transform to uniform document format Store
in document store Data ingestion Collect data Transform to uniform
document format Store in document store Document .. .. Document ..
.. Document .. .. Document .. .. Global analysis Cross document
analysis Rank, group, merge, and filter documents Global analysis
Cross document analysis Rank, group, merge, and filter documents
index Indexing Generate search terms, Index documents by search
terms Indexing Generate search terms, Index documents by search
terms Local analysis: Information extraction from each document
Local analysis: Information extraction from each document DI
BACKEND OVERVIEW
14. 13 Digression: Classical IR S1={f11, f12, } S2={f21, f22, }
S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data
ingestion Given set of files Data ingestion Given set of files
Document .. .. Document .. .. Document .. .. Document .. .. Global
analysis Calculate statistics of terms in documents Global analysis
Calculate statistics of terms in documents index Indexing Generate
search terms, Index by terms with statistics Indexing Generate
search terms, Index by terms with statistics Local analysis:
Tokenize Stop wording Stemming Form n-grams Local analysis:
Tokenize Stop wording Stemming Form n-grams DI BACKEND
OVERVIEW
15. 14 Digression: Classical Web search S1={f11, f12, }
S2={f21, f22, } S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA
Idx Data ingestion Crawl web pages Data ingestion Crawl web pages
Document .. .. Document .. .. Document .. .. Document .. .. Global
analysis Calculate eigenvalues of connection matrix Global analysis
Calculate eigenvalues of connection matrix index Indexing Generate
search terms Index documents by search terms, with page rank
Indexing Generate search terms Index documents by search terms,
with page rank Local analysis: Extract out links Local analysis:
Extract out links DI BACKEND OVERVIEW
16. 15 Demands of Enterprise Search S1={f11, f12, } S2={f21,
f22, } S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data
ingestion Handle variety of sources Handle variety of formats Deal
with access policy Deal with update policy Data ingestion Handle
variety of sources Handle variety of formats Deal with access
policy Deal with update policy Document .. .. Document .. ..
Document .. .. Document .. .. Global analysis Cross document
analysis Rank, group, merge, and filter documents Global analysis
Cross document analysis Rank, group, merge, and filter documents
index Indexing Generate search terms, Index documents by search
terms Indexing Generate search terms, Index documents by search
terms Local analysis: Incorporate domain knowledge Extract rich set
of semantics Categorize documents Local analysis: Incorporate
domain knowledge Extract rich set of semantics Categorize documents
DI BACKEND OVERVIEW
17. 16 Efficient incremental updates Fast turn around time for
updates System performance and reliability Scaling with data size
and resource available Fault tolerance Ease of administration
quality improvement Allow search admin to customize domain specific
configurations BACKEND OVERVIEW CHALLENGES / OPPORTUNITIES
Desiderata of backend
18. 17 Backend Section Outline Overview Data Ingestion Local
analysis Global analysis Indexing
19. 18 Data Ingestion BACKEND DATA INGESTION Doc. Store
Crawl/push Web DB App Convert to document Convert to text From: xxx
To: yyy Date: 12/21 .. .. Attch: file1.pdf Docid: 0002 ___________
.ABCD.. 01/12 .. .. Docid: 0001 ___________ From: xxx To: yyy Date:
12/21 .. .. Attch: file.pdf Email +attach Docid: 0002 ___________
title: ABCD. Date: 01/12 .. .. Docid: 0001 ___________ From: xxx
To: yyy Date: 12/21 .. .. Attch: file.pdf Variety of sources
Support update & retention policy Pdf file
20. 19 Document-centric View Data as a collection of documents
Document as unit of storage and search result. Three major
components Unique document identifier in the whole system Metadata
fields: url, date, language, Content field: text to be searched
Representation of data of different structures Web pages Each page
is a document Relational data Each row is a document Hierarchical
data Each node is a document BACKEND DATA INGESTION
21. 20 Push vs Pull Pull Push Definition Search engine initiate
transfer of data (Web crawler) Content owner initiate transfer of
data (Apps with push notice) Advantage Operated by search engine
Use standard crawlers Can handle special access methods Easy to
adjust refresh rate Easy to handle special format Disadvantage
Difficult to access special data sources Difficult to adjust domain
specific treatment Need synchronization with content owner
Applicability Prevalent for Internet Also useful for enterprise
Rare for Internet Very important for enterprise BACKEND DATA
INGESTION
22. 21 Transform the Data Format conversion Convert content to
text: pdf, doc, Keep as much structure as possible Metadata
conversion Obtain and transform metadata: HTTP headers, DB table
metadata, Merge /split documents One-to-many: Zip file, email
thread, attachments Many-to-one: social tags merge to original doc
BACKEND DATA INGESTION
23. 22 Storage options Options Pro Con SQL database Traditional
RDBM strengths Support insert, update, delete, fielded query, Too
much system overhead Indexing engine (Lucene) Closer to document
centric view Supports insert, delete, fielded query No direct
in-document update Need special treatment for distributed
processing NoSQL databases Light weight Sufficient for simple use
May lack features in the future Transaction? BACKEND DATA INGESTION
Issues to consider In document update Access/Retention policy
Parallel processing
24. 23 Backend Section Outline Overview Data Ingestion Local
analysis Global analysis Indexing
25. 24 Local Analysis Annotating pages Extract structured
elements: title, header, Extract features for people, projects,
communities, Extract features for cross-document analysis.
Categorizing pages Label by standard categories Language,
geography, date, Label pages by custom categories IBM examples: HR,
person, IT help, ISSI, sales information, marketing, corporate
standards, legal & IP-law, Local analysis is essentially
information extraction BACKEND LOCAL ANALYSIS
26. 25 Rule-based IE ML-based IE PRO Declarative Easy to
comprehend Easy to maintain Easy to incorporate domain knowledge
Easy to debug Trainable Adaptable Reduces manual effort CON
Heuristic Requires tedious manual labor Requires labeled data
Requires retraining for domain adaptation Requires ML expertise to
use or maintain Opaque (not transparent) BACKEND LOCAL ANALYSIS
INFORMATION EXTRACTION Rule-based vs. Learning-based IE
27. 26 Commercial Vendors (2013) NLP Papers (2003-2012) 100%
50% 0% 3.5% 21% 75% Rule- Based Hybrid Machine Learning Based 45%
22% 33% Large Vendors 67% 17% 17% All Vendors GATE Information
Extraction IBM InfoSphere BigInsights Microsoft FAST SAP HANA SAS
Text Analytics HP Autonomy Attensity Clarabridge Example Industrial
Systems Source: [CLR2013] Rule-based Information Extraction is
Dead! Long Live Rule-based Information Extraction Systems!, EMNLP
2013 BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION Landscape of
Entity Extraction Implementations
28. 27 Intranet page NavPanel Extraction NavPanels Self link
identification Title Extraction Matching title patterns Titles
Dictionary Match Person name dictionary Person name in title Title
Extraction Matching title patterns Titles Title Name URL Extraction
URLs Matching URL patterns URL Name Person name dictionary =
employee directory IBM Global Services Security Home IBM Global
Services Security G J Chaitin Home Page G J Chaitin 1.
http://w3-03.ibm.com/marketing/ 2.
http://w3-03.ibm.com/isc/index.html 3. http://chis.at.ibm.com/ 1.
marketing 2. isc 3. chis BACKEND LOCAL ANALYSIS EXAMPLES [Zhu et
al., WWW07] Local analysis for different features
29. 28 Consolidation Example: Document language consolidation
HTTP header Accept-Language: en-us,en;q=0.5 Meta tags Document text
encoding URL http://enterprise.com/hr/benefits/us/ca/ BACKEND LOCAL
ANALYSIS TRANFORMATIONS
30. 29 Backend Section Outline Overview Data Ingestion Local
analysis Global analysis Indexing
31. 30 Global Analysis Deduplication Save resources, reduce
result clutter Identify root of URL hierarchy Used for result
grouping and ranking Anchor text analysis Assign external labels to
documents Social tagging analysis Assign tags and their weights to
documents Identify different versions of the same document Due to
variations in date, language, Enterprise-specific global analysis
When certain documents co-exists, do this BACKEND GLOBAL
ANALYSIS
32. 31 Shingle based deduplication (Leskovec,
http://www.mmds.org/) S1={s1, s2, } S2={s1, s3, } S3={s2, s3, }
{h1(S1), h2(S2), } {h1(S2), h2(S2), } {h1(S3), h2(S2), } Document
.. .. Shingles: Character or token n-gram Possibly stemmed Possibly
related to stop words Shingles: Character or token n-gram Possibly
stemmed Possibly related to stop words Document .. .. Document ..
.. Document .. .. Minhash: Maps sets to integers Based on
permutation of universal set Jaccard distance : Theorem: The
probability that the minhash function for a random permutation of
rows produces the same value for two sets equals the Jaccard
similarity of those sets Minhash: Maps sets to integers Based on
permutation of universal set Jaccard distance : Theorem: The
probability that the minhash function for a random permutation of
rows produces the same value for two sets equals the Jaccard
similarity of those sets | AB | / | AB | More diverse set of
documents. More precise. BACKEND GLOBAL ANALYSIS DEDUPLICATION
33. 32 Metadata-based deduplication (IBM Gumshoe search engine)
S1=[h11, h12, ] S2=[h21, h22, ] S3=[h31, h32, ] G1 = {S1, } G2 =
{S2, S3, } Document .. .. Significant metadata: Document title
Section headers Signatures from URL Ensure that all similar
candidates have the same signature Significant metadata: Document
title Section headers Signatures from URL Ensure that all similar
candidates have the same signature Document .. .. Document .. ..
Document .. .. Group by signature Perform detailed analysis
In-group similarity analysis: Analyze documents within candidate
groups Group by signature Perform detailed analysis In-group
similarity analysis: Analyze documents within candidate groups More
customizable for intranet. Less cost. BACKEND GLOBAL ANALYSIS
DEDUPLICATION
34. 33 URL Root Analysis (Zhu et al., WWW07)
host1/b/a/~user1/pub host1/b/a host1/b/a/~user1/ host1/b/c
host1/b/a/x_index.htm/ host1/b/c/d host1/b/c/home.html
host1/b/c/d/e/index.html?a=us host1/b/c/d/e/index.html?a=uk
host1/b/c/d/e/index.html Given a set of documents all with the same
value V of feature X. E.g., At one time all webpages from IBM
Tucson site had the same title Find the roots of URL forest. These
will be preferred result for query X=V. E.g., when searching for
Tuscon home page, only the IBM Tuscon homepage will match. BACKEND
GLOBAL ANALYSIS ROOT ANALYSIS
35. 34 Label Assignment (Zhu et al., WWW07) BACKEND GLOBAL
ANALYSIS LABEL ASSIGNMENT Document B .. .. Document A1 .. X home ..
Document A2 .. X home .. Bookmark C1 X home Anchor text global
analysis: Assign label X and / or Y based on frequency Bookmark C2
X Bookmark C3 Y home Document A2 .. X home .. Social tagging global
analysis: Assign label X home, X, and Y home based on
frequency
36. 35 Entity Integration using HIL Entity Population Rules
Create entities (from raw records, other entities, and links)
Clean, normalize, aggregate, fuse Various data sources Information
Extraction Entity Resolution Fuse Aggregate Entity Integration
Entity Resolution Rules Create links between raw records or
entities Map Unstructured Data Unified entities Defines entity
types (the logical data model of the integration flow) (SQL-like)
rules to specify the integration logic Raw Records HIL [Hernndez et
al, EDBT13] Declarative IE (IBM SystemT) [Chiticariu et al, ACL
2010] Optimizing compiler to Big Data runtime (Jaql and Hadoop)
BACKEND GLOBAL ANALYSIS ENTITY INTEGRATION
37. 36 Backend Section Outline Overview Data Ingestion Local
analysis Global analysis Indexing
38. 37 Indexing Generate and index search terms, to be matched
by terms generated at runtime from user queries. Challenges:
Extracted terms do not match user query terms Morphological
changes, synonyms, Importance of term depends on query Needs for
bucketing of indexes, Support of incremental indexing BACKEND
INDEXING
39. 38 Term normalization Example: Date time normalization
Given any of these Wed Aug 27 10:06:11 PDT 2014 27 Aug 2014,
10:06:11 2014-08-27T10:06:11-07:00 27 Aug 2014 1409133971 Normalize
to 2014-08-27T10:06:11-07:00 Other examples: Person names, product
names, BACKEND INDEXING TERM NORMALIZATION
40. 39 Why Generate Variant Terms? Extracted feature string
query string People names Document: John Doe Search: Doe, John
Search: J Doe Acronym expansions gts Global Technology Services
N-gram variant generation Title: reimbursement of travel expenses
Terms: reimbursement, travel expenses, reimbursement travel,
reimbursement of travel, reimbursement expenses Normalization is
not sufficient solution People names Document: John Doe J. Doe
Search: Jean Doe J. Doe These are not supposed to match Solution:
Generate variant terms with different levels of approximation.
BACKEND INDEXING VARIANT TERM GENERATION
41. 40 Configurable Term Generation Configuration knobs
determine the set of outputs Given Mr. John (Jack) M. Doe Jr.
Configuration1: Initial=both, Dot: with, NickName: both,
MiddleName: both, NameSuffix: without, Title: without, Comma:both
John M. Doe Doe, John M. John Doe Doe, John J. M. Doe Doe, J. M. J.
Doe Doe, J. Jack M. Doe Doe, Jack M. Jack Doe Doe, Jack
Configuration2 (normalization): Initial=without, Dot: without,
NickName: without, MiddleName: without, NameSuffix:without, Title:
without, Comma: without John Doe BACKEND INDEXING VARIANT TERM
GENERATION
42. 41 Enterprise Search Backend S1={f11, f12, } S2={f21, f22,
} S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data
ingestion Access various sources Document transform Format
transform Data ingestion Access various sources Document transform
Format transform Document .. .. Document .. .. Document .. ..
Document .. .. Global analysis Deduplication URL root analysis
Label assignment Global analysis Deduplication URL root analysis
Label assignment index Indexing Generate search terms, Indexing
Generate search terms,Local analysis: Information extraction
Configurable Local analysis: Information extraction Configurable DI
BACKEND RECAP
43. 42 Search Engine Architecture Backend Collect data Analyze
data Store and index data Backend Collect data Analyze data Store
and index data Admin System performance Search quality control
Frontend Interpret user query Search index Present results Interact
with user Frontend Interpret user query Search index Present
results Interact with user index Data source
44. Serving User Queries at Front End (52) 1. Ambiguity (29) 2.
Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy
(8)
45. 44 1. Ambiguity Optimal keywords may not be used.
Misspelled datbase Under-specified polysemy: java too general:
database papers Over-specified: synonyms, acronyms, abbreviations
& alternative names: green card permanent residency too
specific: MS Office 2007 for Mac x64 edition Non-quantitative:
small laptop query cleaning query autocompletion query refinement
query rewriting query rewriting
46. 45 Summary of Solutions query cleaning correct various
types of spelling errors query autocompletion prevent spelling
errors. query refinement making queries more specific, returning
fewer results. query rewriting making queries more general /
on-topic, returning more relevant results. query forms enabling
users to specify precise queries FRONTEND AMBIGUITY
47. 46 Graph-based Spelling Correction (bao acl 11) Repartition
the query. Each partition (token) should be plausible: confidence
(correcting it) > threshold. confidence: linear combination of
multiple scores, parameters learned from SVM. Domain knowledge is
often used in calculating confidence. For each partition, generate
candidate corrections with high scores. enterpricsea rch
enterpricse arch enterpric search enter pric search etc. price: 0.8
prim: 0.6 etc. pric QUERY CLEANING UNSTRUCTURED DATAFRONTEND
AMBIGUITY enterpricsea rch
48. 47 Graph-based Spelling Correction (bao acl 11) Build a
graph that connects candidate corrections. Each full path is a
candidate query. Find k top-weighted full paths enterprise enter
price prim arc sea rich search 1. correction score (node weight) 2.
merge penalty (node weight) 3. split penalty (edge weight)
enterprise search enter price sea rich e.g., weights QUERY CLEANING
UNSTRUCTURED DATAFRONTEND AMBIGUITY price: 0.8 prim: 0.6 etc. pric
enterpricsea rch
49. 48 Graph-based Spelling Correction (bao acl 11) Weight
doesnt consider term correlations. Calculate a score for each path
Score includes term correlations. This ensures the cleaned query
has good quality results. Correlations are computed based on number
of co- occurrences. Finally returns paths with high scores. e.g.,
correlation(enterprise search) > correlation (enterprise arc)
QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY e.g., enterprise
search vs. enterprise arc
50. 49 XClean (lu icde 11) based on the noisy channel model
that finds the intended word given the users input word. results on
XML are subtrees rooted at entity nodes. A result quality score is
calculated for each entity node in T, and then aggregated. e.g., if
Johnny and Mike works in the same department, then Johnn, Mike
Johnny, Mike rather than John, Mike. processes each word
individually, i.e., no merge or split. Query Cleaning on Relational
Data: Pu VLDB 08 related department head Johnny employees QUERY
CLEANING STRUCTURED DATAFRONTEND AMBIGUITY
51. 50 Summary of Solutions query cleaning correct various
types of spelling errors query autocompletion prevent spelling
errors. query refinement making queries more specific, returning
fewer results. query rewriting making queries more general /
on-topic, returning more relevant results. query forms enabling
users to specify precise queries FRONTEND AMBIGUITY
52. 51 Query Autocompletion Problem Space Dimensions showing
keywords vs. showing results single keyword vs. multiple keyword
exact matching vs. fuzzy matching QUERY AUTOCOMPLETIONFRONTEND
AMBIGUITY
53. 52 Problem Space Dimensions showing keywords vs. showing
results single keyword vs. multiple keyword exact matching vs.
fuzzy matching Error-Tolerating Autocompletion (chaudhuri sigmod
09) desr desert dessert deserve QUERY AUTOCOMPLETIONFRONTEND
AMBIGUITY
54. 53 n c ae x Error-Tolerating Autocompletion (chaudhuri
sigmod 09) data contains search, sand and text max. edit distance =
1 no input input: s input: se input: sen s a r t e t h d n c ae x s
a r t e t h d n c ae x s a r t e t h d n c ae x s a r t e t h d
Showing results instead of keywords can be achieved by associating
inverted lists to trie nodes. trie QUERY AUTOCOMPLETIONFRONTEND
AMBIGUITY
55. 54 Tastier(li vldbj 11) Problem Space Dimensions showing
keywords vs. showing results single keyword vs. multiple keyword
exact matching vs. fuzzy matching have a nni show results for have
a nice day QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY
56. 55 Tastier(li vldbj 11) Trie-based (similar as previous
paper). Trie leaf nodes are associated with inverted lists. To
handle multiple keywords: Each record/document is associated with a
sorted lists of words in it (forward lists). so that a binary
search can determine whether a string appears in a record/document
as a prefix. why not hash? Because we need to match prefix, not
whole words. Inverted list intersections are computed incrementally
using cache for improved efficiency. have a nice day a, day, have,
nice example forward list QUERY AUTOCOMPLETIONFRONTEND
AMBIGUITY
57. 56 Phrase Prediction(nandi vldb 07) Problem Space
Dimensions showing keywords vs. showing results single keyword vs.
multiple keyword exact matching vs. fuzzy matching QUERY
AUTOCOMPLETIONFRONTEND AMBIGUITY a nice have a nice day
58. 57 Phrase Prediction(nandi vldb 07) Suggest phrases given
the user input phrase. Need to find a good length of a suggested
phrase. Too short: utility is small. Too long: low chance of being
accepted. (modified) suffix tree-based. Each node is a word, rather
than a letter. Why not use trie: phrases have no definitive
starting point. A phrase may start in the middle of a sentence
(i.e., start at a suffix of the sentence), hence suffix tree.
Significant phrases. laptop have a nice day QUERY
AUTOCOMPLETIONFRONTEND AMBIGUITY
59. 58 Summary of Solutions query cleaning correct various
types of spelling errors query autocompletion prevent spelling
errors. query refinement making queries more specific, returning
fewer results. query rewriting making queries more general /
on-topic, returning more relevant results. query forms enabling
users to specify precise queries FRONTEND AMBIGUITY
60. 59 Query Refinement Motivation Some under-specified queries
on large data corpus have too many results. Ranking cannot always
be perfect. Approaches Identifying important terms in results
(structured/unstructured) Clustering results
(structured/unstructured) Faceted search (structured) FRONTEND
AMBIGUITY QUERY REFINEMENT
61. 60 Using Clustered Results (liu pvldb 11) All suggested
queries are about programming language. It is desirable to refine
an ambiguous query by its distinct meanings. Java FRONTEND
AMBIGUITY QUERY REFINEMENT
62. 61 Input: clustered results clustering method is
irrelevant. e.g., the result of Java may have 3 clusters
corresponding to Java language, Java island, and Java tea. Output:
one refined query for each cluster. Each refined query: maximally
retrieves the results in its cluster (recall) minimally retrieves
the results not in its cluster (precision) Using Clustered Results
(liu pvldb 11) FRONTEND AMBIGUITY QUERY REFINEMENT
63. 62 Using Important Terms in Results (tao edbt 09) For
relational data only. Given a keyword query, it outputs top-k most
frequent non-keyword terms in the results, without generating the
results. Avoiding result generation is possible since the terms are
ranked only by frequency: tradeoff of quality and efficiency. Data
Clouds (for structured data): Koutrika EDBT 09 (more sophisticated
term ranking, but needs to generate query results first.) related
FRONTEND AMBIGUITY QUERY REFINEMENT
64. 63 Faceted Search all location: Sunnyvale, CA location:
Phoenix, AZ location: Amherst, MA department: data management
department: machine learning 1. How to select facets and facets
conditions at each level, to minimize the users expected navigation
cost? 2. How to rank facets and facets conditions? challenges
Chakrabarti SIGMOD 04 Kashyap CIKM 10 FRONTEND AMBIGUITY QUERY
REFINEMENT
65. 64 Summary of Solutions query cleaning correct various
types of spelling errors query autocompletion prevent spelling
errors. query refinement making queries more specific, returning
fewer results. query rewriting making queries more general /
on-topic, returning more relevant results. query forms enabling
users to specify precise queries FRONTEND AMBIGUITY
66. 65 Query Rewriting Motivation Synonyms, alternative names:
green card vs permanent residency. Too specific: MS Office 2007 for
Mac x64 edition Non-quantitative: small laptop Approaches Using
query/click logs Finding rewriting rules from missing results e.g.,
replace green card with permanent residency. Using differential
queries FRONTEND AMBIGUITY QUERY CLEANING
67. 66 Using Query and Click Logs (cheng icde 10) The
availability of query and click logs can be used to assess ground
truth. query Q query log click log synonyms hypernyms hyponyms of Q
query search synonym MySQL database hypernym database MySQL hyponym
find and return historical queries whose ground truth (via click
log) significantly overlaps with top-k results of Q. idea FRONTEND
AMBIGUITY QUERY CLEANING
68. 67 Automatic Suggestion of Rewriting Rules from Missing
Results (bao sigir 12) Challenges for automatically generating
rewriting rules: rules should be semantically natural. a new rule
designed for one query may eliminate good results of another query.
FRONTEND AMBIGUITY QUERY CLEANING green card result d is missing /
should be ranked higher result d contains phrase permanent
residency rewriting rule: green card permanent residency
69. 68 Input: query q, missed desirable results d Output:
selected set of rules Generate candidate rules L R. L: n-grams in
q. R: n-grams in high- quality fields of d. Identify semantically
natural rules by machine learning. Greedily select a subset of
rules that maximizes the overall query quality. Automatic
Suggestion of Rewriting Rules from Missing Results (bao sigir 12)
FRONTEND AMBIGUITY QUERY CLEANING green card permanent residency
green card federal government
70. 69 Keyword++ (Entity Databases) (xin pvldb 10) small IBM
laptop ID Product Name BrandName Screen Size Description 1 ThinkPad
E545 Lenovo 15 The IBM laptop...small business 2 ThinkPad X240
Lenovo 12 This notebook... To understand a term, compare two
queries that differ on this term, and analyze the differences of
attribute value distributions in the results. idea e.g., to
understand term IBM, we can compare the results of IBM laptop vs.
laptop. FRONTEND AMBIGUITY QUERY CLEANING
71. 70 Suppose: IBM laptop 50 results, 30 having brand: Lenovo
laptop 500 results, only 50 having brand: Lenovo The difference on
brand: Lenovo is significant, reflecting the meaning of IBM. IBM
brand: Lenovo small order by size ASC Offline: compute the best
mapping for all terms in query log Online: compute the best
segmentation of the query (DP). laptop small laptop likewise:
Keyword++ (Entity Databases) (xin pvldb 10) FRONTEND AMBIGUITY
QUERY CLEANING
72. 71 Summary of Solutions query cleaning correct various
types of spelling errors query autocompletion prevent spelling
errors. query refinement making queries more specific, returning
fewer results. query rewriting making queries more general /
on-topic, returning more relevant results. query forms enabling
users to specify precise queries FRONTEND AMBIGUITY
73. 72 Offline: how many query forms, and which query forms,
should be generated? Too many hard to find the relevant forms. Too
few limiting query expressiveness. Online: how to identify query
forms relevant to users search needs? Query Forms Enabling users to
issue precise structured queries without mastering structured query
languages. advantage challenges Baid SIGMOD 09 Jayapandian PVLDB 08
Ramesh PVLDB 11 Tang TKDE 13 FRONTEND AMBIGUITY QUERY FORMS
74. Serving User Queries at Front End (52) 1. Ambiguity (29) 2.
Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy
(8)
75. 74 2. Ranking Ranking Method Categories Unstructured Data
represents queries and documents using vectors each component is a
term; the value is its weight ranking score = similarity (query
vector, result vector) Structured Data a document a node or a
result (subgraph/subtree) vector space model proximity based
ranking authority based ranking FRONTEND RANKING
76. 75 2. Ranking Ranking Method Categories Unstructured Data
proximity of keyword matches in a document can boost its ranking.
Structured Data weighted tree/graph size, total distance from root
to each leaf, semantic distance, etc. vector space model authority
based ranking proximity based ranking FRONTEND RANKING
77. 76 2. Ranking Ranking Method Categories vector space model
Unstructured Data nodes linked by many other important nodes are
important. Structured Data authority may flow in both directions of
an edge different types of edges in the data (e.g., entity-entity
edge, entity-attribute edge) may be treated differently. proximity
based ranking authority based ranking FRONTEND RANKING
78. Serving User Queries at Front End (52) 1. Ambiguity (29) 2.
Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy
(8)
79. 78 3. Representation Enterprise corpus can be much more
heterogeneous than a collection of documents or web pages.
Different searches may have different types: retrieving a document,
a figure, a tuple, a subgraph, analytical keyword queries, etc.
Result diversification Result summarization Result differentiation
solutions FRONTEND REPRESENTATION
80. 79 Result Diversification Result diversification is
essentially the same problem as query refinement. e.g., Java Java
language, Java tea, Java island. Same techniques apply. FRONTEND
REPRESENTATION DIVERSIFICATION
81. 80 Result Summarization Unstructured data: lots of work on
text summarization in machine learning, natural language processing
and IR communities. Structured data: Size-l object summary
(Relational) Result snippet (XML) Das, CMU 07 (unpublished)
Nenkova, Mining Text Data 12 surveys FRONTEND REPRESENTATION
SUMMARIZATION
82. 81 Size-l Object Summary (fakas pvldb 11) Mike first window
Mike unstructured Mike paper paper patent patent conference John ?
structured FRONTEND REPRESENTATION SUMMARIZATION
83. 82 Size-l Object Summary (fakas pvldb 11) Each tuple has: a
static importance score. similar idea as PageRank a run-time
relevance score. distance to result root connectivity properties to
result root Objective: find a connected snippet of the result,
which consists of l tuples and has the maximum score. Dynamic
programming based solution. Result snippet for XML: Liu TODS 10
related FRONTEND REPRESENTATION SUMMARIZATION
84. 83 Result Differentiation Result 1 Result 2 event: year
2000 2012 paper: title OLAP data mining cloud scalability search
NEC Labs Open House result 1: a large table with many people /
papers / posters result 2: a large table with many people / papers
/ posters results result differentiation vs. comparing different
credit cards on a bank website: only with pre-defined features.
FRONTEND REPRESENTATION DIFFERENTIATION
85. 84 4. Expert Search documents in which a candidate and a
topic co-occur topics near a candidate in a document problem
solving / ticket routing history users knowledge on a topic expert
should be more knowledgeable social relationship between expert and
user problem solving is usually more effective if expert has a
close social relationship with user external corpus many employees
publish stuff externally, i.e., papers, blogs. ways for judging an
expert Find an expert within an enterprise to solve a particular
problem. goal FRONTEND EXPERT SEARCH
86. 85 Classical Methods Builds a feature vector for each
expert using various evidence Ranks experts based on query, using
traditional retrieval models candidate model First finds documents
related to query, then locates experts in documents Mimics the
process a human takes. document model Balog CIKM 08 survey FRONTEND
EXPERT SEARCH
87. 86 User-Oriented Model (smirnova ecir 11) Users prefer
experts who: are more knowledgeable than themselves. knowledge
gain: p(e|q) p(u|q) have a close social relationship with
themselves. time-to-contact: shortest path department head John
employees e = expert u = user FRONTEND EXPERT SEARCH
88. 87 Using Web Search Engine (santos inf. process. manage.
11) query q result from intranet web query q result from
internetformulate web query search intranet corpus combine
candidates full name: Jeff Smisek organizations name: IBM terms in
q: data integration excluding results from organization:
-site:ibm.com FRONTEND EXPERT SEARCH
89. 88 Ticket Routing (shao kdd 08) new ticket: DB2 login
failure transferred to group A transferred to group B transferred
to group C resolved How to find the best group and reduce problem
solving time? Markov chain model Using only previous routing
history (not ticket content) FRONTEND EXPERT SEARCH
90. 89 Ticket Routing (shao kdd 08) Pr(g|S) possibility to
route a ticket to group g given previous groups S Pr(g|S) includes
the probability that: g can solve the ticket g can correctly
re-route the ticket. Train the Markov chain model from ticket
routing history. FRONTEND EXPERT SEARCH
91. Serving User Queries at Front End (52) 1. Ambiguity (29) 2.
Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy
(8)
92. 91 5. Privacy It is sometimes desirable that the search
engine doesnt know which documents a user wants to retrieve. For
users: privacy For enterprises: avoiding liability user privacy
While a search engine answers individual keyword searches, there
are methods that perform multiple searches and, from the answers,
piece together aggregate information about underlying corpus.
Enterprises may not want to disclose such information to all users.
data privacy
93. 92 User Privacy Private Information Retrieval (PIR) old
topic, tons of theoretical papers Modifying search engine. e.g.,
forcing it to forget user activities embellishing queries with
decoy terms (Pang PVLDB 10) Using ghost queries to obfuscate user
intention (Pang ICDE 12) no change to search engine light-weight
solutions It is sometimes desirable that the search engine doesnt
know which documents a user wants to retrieve. For users: privacy
For enterprises: avoiding liability user privacy
94. 93 Private Information Retrieval (PIR) Idea: retrieve more
documents than needed. Nave: retrieve the entire corpus. How to
minimize the number of retrieved & unneeded documents? Tons of
theoretical papers on different variations of the problem, e.g.,
different computation power of the search engine different number
of non-communicating corpus replica. Gasarch EATCS Bulletin 2004
survey
95. 94 Ghost Queries (pang icde 12) Challenges Generate ghost
queries on topics different from users topics of interest, and make
it difficult for the search engine to infer users topics. Ghost
queries need to be meaningful/realistic, so that they cannot be
easily identified. generate ghost queries ghost queries discard
ghost query results results submit to search engine user query
96. 95 Ghost Queries (pang icde 12) (e1, e2) privacy model
Given a user query, if the probability of a topic increases more
than e1, it should be reduced to below e2 by the ghost queries.
Topics are predefined. A ghost query must be coherent: all words in
the ghost query should describe common or related topics.
Randomized algorithm based solution.
97. 96 Data Privacy While a search engine answers individual
keyword searches, there are methods that perform multiple searches
and, from the answers, piece together aggregate information about
underlying corpus. Enterprises may not want to disclose such
information to all users. data privacy inserting dummy tuples OR
randomly generating attribute values only applicable to structured
data disallowing certain queries OR return snippets search quality
loss altering a small number of results: adding dummy results;
modifying results, hiding some results (Zhang SIGMOD 12) solutions
FRONTEND PRIVACY
98. 97 Aggregate Suppression (zhang sigmod 12) Example:
consider corpus A and B. A: n documents B: 2n documents A B Goal:
suppress COUNT(*), i.e., adversary cannot tell which corpus is
larger. Nave approach 1: deterministically remove n documents from
B. achieves the goal, but with search utility loss: those n
documents can never be retrieved. Nave approach 2: randomly drop
half of the results at run time. no search utility loss, but fails
to achieve the goal: a clever adversary can still get the
information. FRONTEND PRIVACY
99. 98 Aggregate Suppression (zhang sigmod 12) Algorithm ideas
carefully adjusting query degree (number of documents matched by a
query) and document degree (number of queries matching a document)
by document hiding at run-time. decline a query if its result can
be covered by a small number of previous queries. Return previous
query results instead. FRONTEND PRIVACY
100. 99 Backend Collect data Analyze data Store and index data
Admin System performance Search quality control/improvement Admin
System performance Search quality control/improvement Frontend
Interpret user query Search index Present results Interact with
user index Data source Tutorial Outline
101. 100 Enterprise Search Administrators Main responsibilities
Care and feeding of an enterprise search solution Monitor intranet
help inboxes and respond to requests. Assist in troubleshooting
intranet issues for content contributors Core skills required
Understand general corporate business processes Experience in
coordinating activities and managing relationships with employees,
content administrators, stakeholders, IT teams and external
agencies Search Admin Search administrators IR experts Key
Observation Admin Overview
102. 101 What a Search Administrator Need? Bad results for
query Im missing the golden URL Result 22 should be ranked much
higher! Enterprise Users Query Logs Query global campus seems
unsatisfying Understand overall search quality Overall trend YOY
change By segmentation Understand individual search results Why
certain result is or isnt brought back Its ranking Maintain search
quality Underlying data evolves Terminology changes Policy/Business
Process changes Organization changes Hot topics Search Admin Admin
Overview
105. 104 What a Search Administrator Need? Bad results for
query Im missing the golden URL Result 22 should be ranked much
higher! Enterprise Users Query Logs Query global campus seems
unsatisfying Understand overall search quality Overall trend YOY
change By segmentation Understand individual search results Why
certain result is or isnt brought back Its ranking Maintain search
quality Underlying data evolves Terminology changes Policy/Business
Process changes Organization changes Hot topics Search Admin Admin
Examples
114. Enterprise Search in the Big Data Era Case Study: IBM
Intranet Search
115. 114 Experience at IBM Internal Search IBM deployed a
commercially available search engine Implementing standard IR
techniques Search quality went down over time to the point that
Search results were unacceptable! Success ( 1 relevant results):
14% on top-1, 23% on top-5, 34% on top-50! [Zhu et al., WWW07] So,
they implemented various solutions To the administrators managing
the engine, exposed control knobs were insufficient Case Study
Background
116. 115 Attempts to Improve Search Enhanced link analysis by
incorporating external links to/from external WWW Creative hacks:
added fake terms to documents & queries # terms per document
determined by popularity: how much TF increase required for needed
rank boost ? Hard-coded custom results for the top 1200+ queries
Didnt help Quality went down! Maintenance nightmare: Heuristic
needs to be updated upon each nontrivial change in term
stats./ranking parameters Even bigger nightmare! How to deal with
continuously changing terminology? Case Study Background
117. 116 Goals of Gumshoe Network Station Manager search Thin
Client ManagerProduct names change: Continually changing
terminology Domain-specific meaning Paula Summa search bring Paula
Summa from employee directories per diem search Domain-specific
repetitions popcorn search conference call! Result 1: IBM Travel:
Per Diem Result 2: IBM Travel: Per Diem Rates Result 3: IBM Travel:
National perdiems Result 25: IBM Travel: Per Diem Policy Gumshoe:
Generic search solution, customizable & maintainable in many
domains Simple customization with reasonable effort Ongoing
search-quality management Philosophy: programmable search Case
Study Background
118. 117 Programmable Search: Main Idea Goals: Transparency
Know precisely why every result item is being brought back
Understand how changes in content/intents affect search
Maintainability and Debugability Ranking logic is guided by
explicit rules Properly react to changes in content/intents
Building blocks: Deep analytics on documents Domain-specific
analysis of queries Transparent customizable rule-driven ranking
runtime rules backendbackend analytics interpretations Case Study
Background
119. 118 Distributed Analytics Platform (IBM InfoSphere
BigInsights) Crawling, information extraction, token generation
(TG), indexing Search runtime Index Index and rule update services
backendbackend analytics runtime rulesinterpretations backend
frontend Implementation Architecture Case Study Background
120. 119 Backend Analytics: 3 Parts Local Analysis (per-page
analysis) Global Analysis (cross-page analysis) Token Generation
(TG) index Case Study Background
121. 120 Local Analysis Categorizing pages Label pages by
custom categories IBM examples: HR, person, IT help, ISSI, sales
information, marketing, corporate standards, legal & IP-law,
Geo classification Associate documents with the relevant countries
& regions Annotating pages Identify HomePage annotation for
people, projects, communities, Simply knowing where a page is
physically hosted is not enough (example: Czech Republic hosts all
pages for IBM in Europe) Case Study Backend Local Analysis
122. 121 Declarative approach Define an operator for each basic
operation Input tuple of annotations Output tuples of annotations
Compose operators to build complex extractors Algebraic expression
One document at a time trivial parallelism. Benefits of declarative
approach: Expressivity: Richer, cleaner rule semantics Performance:
Better performance through optimization Declarative IE System Case
Study Backend Local Analysis
123. 122 InfoSphere Streams Cost-based optimization ... SystemT
Overview InfoSphere BigInsights SystemT RuntimeSystemT Runtime
Input Documents Extracted Objects SystemTSystemT IBM Engines UIMA
SystemT Highly embeddable runtime AQL Extractors Embedded machine
learning model AQL Rules create view SentimentForCompany as select
T.entity, T.polarity from classifyPolarity (SentimentFeatures ) T;
create view Company as select ... from ... where ... create view
SentimentFeatures as select from ; Case Study Backend Local
Analysis
124. 123 G J Chaitin Home Page Homepage Identification Title
Extraction Matching titleMatching title patterns Title s Dictionary
Match Home Page for G J Chaitin http://w3.ibm.com/hr/idp/
http://w3-03.ibm.com/isc/index.html http://chis.at.ibm.com/ URL
Extraction URLs Matching URLMatching URL patterns Homepage for: idp
isc chis Employee directory many more Intranet page [Zhu et al.,
WWW07] Case Study Backend Local Analysis
125. 124124 IBM Confidential124 IBM Confidential Among the 38
pages with the exact same title, which is the best for Paula Summa?
Role of Global Analysis Case Study Backend Global Analysis
126. 125 Person Title Token Generation (TG) Annotated values
Index content Ching-Tien T. (Howard) Ho Ho Ching-Tien Tien Ho Ho,
Tien Howard Ho Ching-Tien H. ... Global Technology Services TG
Howard Ho Ching Tien ... gts Global Technology Services Global
Technology Technology Services Global Technology ...
GlobalTechnologyServices nGramTG spaceTG Case Study Backend Token
Generation
127. 126 3 Phases of Runtime Flow Search Query Phase 1: Query
Semantics Rewrite rules Query interpretation Phase 2: Relevance
Ranking By relevance buckets + conventional IR Phase 3: Result
Construction Grouping rules Re-ranking rules Case Study
Frontend
128. 127 Phase 3: Result Construction Phase 2: Relevance
Ranking Phase 1: Query Semantics query search rewrite rules queries
interpretations partially ordered interpretations interpretations
execution partially ordered results result aggregation ordered
results grouping rules ordered & grouped results final results
re-ranking rules Runtime Flow in More Details Case Study
Frontend
129. 128 Runtime Rules: Pattern-Action Language (Fagin 2012)
Query Pattern Queries Matching Possible Action EQUALS
[r=ibm|information|info] [d=COUNTRY] ibm germany info india Rewrite
into [country] hr (e.g., germany hr) ENDS_WITH installation acrobat
installation db2 on aix installation Replace installation with ISSI
(e.g., acrobat ISSI) CONTAINS directions to [d=SITE] driving
directions to almaden directions to watson from jfk Pages of
siteserv category should be ranked higher STARTS_WITH [d=PERSON]
john kelly biography steve mills announcement Group together pages
that represent blog entries Pattern expression, matched against the
keyword query Perform when matchQuery pattern Action Similar to the
query-template rules of Agarwal et al. [WWW 2010] Query
SemanticsCase Study Frontend
130. 129129 Whats Best for Benefits? Query SemanticsCase Study
Frontend
131. 130130 The most important IBM page for benefits changes
over time: currently it is netbenefits Whats Best for Benefits?
Query SemanticsCase Study Frontend
132. 131 Rewrite Rules benefits netbenefits Query SemanticsCase
Study Frontend
134. 133 133 IBM Confidential People with first name Jim How
can we avoid pages from people category? java jim Complex Rules
Query SemanticsCase Study Frontend
135. 134 134 IBM Confidential Complex Rules java jim and not in
person category Query SemanticsCase Study Frontend
136. 135 135 IBM Confidential Complex Rules java jim and not in
person category interpretations execution partially ordered results
result aggregation ordered results grouping rules ordered &
grouped results final results re-ranking rules interpretations
partially ordered interpretations rewrite rules queries java search
Query SemanticsCase Study Frontend
137. 136 InterpretationsScenario: An IBM employee wants to
download Lotus Symphony 1.3 Runtime interpretation: download
symphony 1.3 category=issi software=symphony 1.3 Query
SemanticsCase Study Frontend
138. 137 IBM Confidential Complex Rules java jim and not in
person category interpretations execution partially ordered results
result aggregation ordered results grouping rules ordered &
grouped results final results re-ranking rules interpretations
partially ordered interpretations rewrite rules queries java search
Query SemanticsCase Study Frontend
139. 138 3 Phases of Runtime Flow Search Query Phase 1: Query
Semantics Rewrite rules Query interpretation Phase 2: Relevance
Ranking By relevance buckets + conventional IR Phase 3: Result
Construction Grouping rules Re-ranking rules Relevance RankingCase
Study Frontend
140. 139 Person Title Recall: Token Generation (TG) Annotated
values Index content Ching-Tien T. (Howard) Ho Global Technology
Services TG Howard Ho Ching Tien ... gts Global Technology Services
Global Technology Technology Services Global Technology ...
GlobalTechnologyServices nGramTG spaceTG Ho Ching-Tien Tien Ho Ho,
Tien Howard Ho Ching-Tien H. ...Person + personNameTG Person +
nGramTG Title + acronymTG Title + spaceTG Title + nGramTG Relevance
RankingCase Study Frontend
141. 140 Annotation + TG Relevance Bucket Howard Ho Ching Tien
... GlobalTechnologyServices Person + personNameTG Person + nGramTG
Title + acronymTG Title + spaceTG Title + nGramTG query search
Relevance buckets Buckets are ranked Based on annotation type Based
on TG quality A page can belong to multiple buckets Within each
bucket, ranking is by conventional IR Relevance RankingCase Study
Frontend
142. 141 Ranking by Relevance Buckets grouping rules ordered
& grouped results final results re-ranking rules
interpretations partially ordered interpretations rewrite rules
queries interpretations execution partially ordered results result
aggregation ordered results employment verification search
Relevance RankingCase Study Frontend
143. 142 3 Phases of Runtime Flow Search Query Phase 1: Query
Semantics Rewrite rules Query interpretation Phase 2: Relevance
Ranking By relevance buckets + conventional IR Phase 3: Result
Construction Grouping rules Re-ranking rules Result
ConstructionCase Study Frontend
144. 143 Grouping Rules Grouping rules define how search
results should be grouped together Search administrators can
improve the diversity of search results (in 1st page) Based on
their familiarity with the data sources Group pages of the same
category per diem travel, you-and-ibm ANY ISSI, IT Help Central,
Forum, Bluepedia, Media Library, Query pattern Result
ConstructionCase Study Frontend
145. 144 Need first page diversity Flooding with Similar Pages
Result ConstructionCase Study Frontend
146. 145145 IBM Confidential per diem travel, you-and-ibm
Grouping Rule to the Rescue Result ConstructionCase Study
Frontend
147. 146146 IBM Confidential per diem travel, you-and-ibm final
results re-ranking rules interpretations partially ordered
interpretations rewrite rules queries interpretations execution
partially ordered results result aggregation ordered results
grouping rules ordered & grouped results per diem search
Grouping Rule to the Rescue Result ConstructionCase Study
Frontend
148. 147 Re-ranking Rules Re-ranking rules adjust ranking of
search results based on categories Example: search administrator
specifies the important sources of hot/current topics Hot topics
Rank these categories higher Bluepedia, News, About-IBM smarter
planet, cloud computing, centennial, Result ConstructionCase Study
Frontend
149. 148 Bluepedia Technical News Homepages of About IBM Hot
topics Rank these categories higher Bluepedia, News, About-IBM
smarter planet, cloud computing, centennial, Re-ranking Rule for
Hot Topics Result ConstructionCase Study Frontend
150. 149 Re-ranking Rules for Person Queries [d=PERSON]
executive_corner, media_library, organization_chart, files Result
ConstructionCase Study Frontend
151. 150150 IBM Confidential per diem travel, you-and-ibm final
results re-ranking rules interpretations partially ordered
interpretations rewrite rules queries interpretations execution
partially ordered results result aggregation ordered results
grouping rules ordered & grouped results per diem search
Grouping Rule to the Rescue Result ConstructionCase Study
Frontend
152. 151 3 Phases of Runtime Flow Search Query Phase 1: Query
Semantics Rewrite rules Query interpretation Phase 2: Relevance
Ranking By relevance buckets + conventional IR Phase 3: Result
Construction Grouping rules Re-ranking rules Case Study
Frontend
153. 152 What Administrators Need Search administrators have
major problems with an opaque search engine Programmable search
provides Customization to the specific domain Ongoing
search-quality management Allows the building of search quality
toolkit. Recap: Case Study Admin
154. 153 Gumshoe Search Quality Toolkit! Case Study Admin
155. Demo
156. 155 Demo Case Study Admin
157. 156 Proof of Pudding is in the Eating Immediate Positive
Impact within first 3 months Improve natural clickthrough rate by
100%+ Top 5 results: selected about 90% of the time Sustained
search quality Improvements 4 years since going alive Stable
natural search click through rate Gumshoe (Aug. 2011 Oct. 2011) Old
Intranet Search (Aug. 2010 Aug. 2011) Natural clickthrough rate
Case Study Results
158. 157 Summary Programmable search: Simple & flexible
customization Search quality management Backend Analytics Local
analysis (per-page analysis) Global Analysis (cross-page analysis)
Token Generation (TG) [Fagin et al., PODS10, PODS11] Tooling Search
provenance Rule suggestion Utilization of relevance buckets [Li et
al., SIGIR06, Zhu et al., WWW07] Phase 1: Query Semantics Rewrite
rules Query interpretation Phase 2: Relevance Ranking By relevance
buckets + conventional IR Phase 3: Result Construction Grouping
rules Re-ranking rules [ Bao et al., ACL2010, SIGIR2012 CIKM2012]
Case Study Summary
159. Enterprise Search in the Big Data Era Future
Directions
160. 159 Search Engine Components Backend Collect data Analyze
data Store and index data Admin System performance Search quality
control/improvement Frontend Interpret user query Search index
Present results Interact with user index Data source
161. 160 Future Directions Data Heterogeneity A rich variety of
data types need to be searched in enterprises. docs, databases,
images, videos, social graphs, etc. observations How to
automatically identify relevant data types, and search and rank
across different data types? e.g., for image search, should image
recognition techniques be incorporated in enterprise search
engines? If so, how? questions
162. 161 Future Directions Data Freshness New data is
continuously collected and published in enterprises, the rate of
which can be very fast. Web search engines are not required to
index new websites quickly, but in enterprises, new contents may
need to be searchable asap. observations How to build efficient
real-time indexes to ensure data freshness in enterprise search?
questions
163. 162 Future Directions Search Context Enterprise search
users have richer profiles than web users. activities, bio,
position, projects, experiences, etc. observations How to utilize
users contexts to provide customized results? Is it possible to
predict the information a user may want, and push it to the user?
questions
164. 163 Future Directions User Preference Different users in
an enterprise have different expertise, and may prefer different
ways to express queries. e.g., some users prefer pure keyword
search, while others may want lightly-structured queries.
observations How to effectively satisfy different needs for
expressing queries for different users? questions
165. 164 Future Directions Question Answering The purpose of
many enterprise searches are to find answers to questions. e.g.,
what is the previous name of a product, and when did we change to
the current name? observations Is it possible to effectively use
natural language processing techniques and domain knowledge to
automatically answer natural language questions? questions
166. 165 Future Directions Transactional Search Over 1/3 of
enterprise search queries is transactional. It will be desirable if
enterprise search engines can recommend business processes to
accomplish a certain task given a transactional search. E.g., given
a customers lengthy complaint letter, how to find out the
departments relevant to the complaints. observations How to better
support transactional search? How to initiate a business process
based on the results of a search? questions
167. 166 Future Directions Big Data Analytics Rich information
and knowledge lies in big data. Many employees (not just data
analysts) may benefit from the ability to perform analytics on the
companys big data. observations How to build a low-cost,
interactive platform that allows a large number of employees to
issue analytical queries? How to give employees the capabilities to
analyze big data, if they have little knowledge of SQL or MapReduce
programming? questions
168. 167 Future Directions Tooling for Search Quality
Maintenance Most enterprise search engines have to be manually
evaluated and tuned by a search administrator with domain
knowledge, in an ad-hoc fashion. observations Can we automate this
process, or at least minimize manual involvement? Can we fully
utilize explicit user feedbacks? Explicit user feedbacks are easier
to obtain in enterprise search, and there are less spams.
questions
169. Thanks. Acknowledgement: IBM Research: Sriram Raghavan,
Fred Reiss, Shiv Vaithyanathan, Ron Fagin IBM CIOs Office: Nicole
Dri, Brian C. Meyer LogicBlox: Benny Kimelfeld* TripAdvisor:
Adriano Crestani Campos* Facebook: Zhuowei Bao* NJIT: Yi Chen UNSW:
Wei Wang * work done while at IBM