Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

1. Enterprise Search in the Big Data Era Yunyao Li Ziyang Liu Huaiyu Zhu IBM Research - Almaden NEC Labs IBM Research - Almaden

2. 1 Enterprise Search Providing intuitive access to an organizations various digital content 1 Report Find IDC report [IDC 05] $5k/person/year wasted salary due to poor search 9-10hr/person/week doing search unsuccessful 1/3-1/2 of the time Butler Group [Edwards 06] 10% of salary cost wasted through ineffective search Accenture survey [Accenture 07] Middle managers spend 2 hr/day searching >50% of what they found have not value Hawking, Enterprise Search, http://david-hawking.net/pubs/ModernIR2_Hawking_chapter.pdf [IDC 05] the enterprise workplace: How it will change the way we work. IDC Report 32919 [Edwards 06] www.butlergroup.com/pdf/PressReleases/ESRReportPressRelease.pdf [Accenture 07] http://newsroom.accenture.com/article_display.cfm?article_id=4484

3. 2 Magic Search from Users Point of View Results 1 .. 2 . 3 .. 4 . INTRODUCTION SEARCH

4. 3 What Happens Behind the Scene Backend Collect data Analyze data Index data Frontend Serve user queries Return results Index Data Source INTRODUCTION SEARCH

5. 4 How Does a Query Match a Document? Index Document .. .. .. Document .. .. .. Results Doc 1 .. Doc 2 . Doc 3 .. Doc 4 . Analyze query Present results Analyze document Search index Build index INTRODUCTION SEARCH

6. 5 Search Is More Than Keyword Match Specific features in documents are important Title, url, person name, product, actions, Features combine to form higher level concepts In document: Home page + person personal homepage Cross document: URL link analysis, The string representation in document may not match that in user query Person name: Bill Clinton William Jefferson Clinton User queries may be ambiguous Multiple interpretations Presenting the results to user Ranking, grouping, interactive refinement INTRODUCTION SEARCH

7. 6 Internet vs Enterprise Web data [Fagin WWW2003] Internet Enterprise Creation of content Democratic Appealing to reader Links approval Bureaucratic Conform to mandate Links internal structure Relevant query results Large number Overlapping information Reasonable subset suffices Ranking is more universal Small number Specific function Specific pages required Ranking is relative to query Spamming Spam infested Ranking can only be based on external authority Mostly spam-free Ranking based on content or metadata are reliable Search engine friendliness Web pages designed to be search results Web page document Documents not designed to be search results Special treatment INTRODUCTION ENTERPRISE VS INTERNET

8. 7 Internet vs Enterprise Big Data Internet Enterprise Content being searched Sources: Web crawl Formats: html, xml, pdf, Variety of sources Variety of formats: Email, database, application- specific access and formats Search queries /expected results Target: web pages, office documents Expect list of documents Expect little personalization Return result directly Target: rows, figures, experts, ... Expect customized results Personalization required: geography, access, Customize results Related information Link approval Small number of domain- specific knowledge Generic analysis Link organization structure Large number of dynamic domain-specific knowledge Highly specialized analysis Skill set of search admins Large number of admins Search experts Facilitate update of search algorithms Small number of admins Domain experts Facilitate use of domain knowledge INTRODUCTION ENTERPRISE VS INTERNET

9. 8 Search Engine Components Backend Collect data Analyze data Store and index data Admin System performance Search quality control/improvement Frontend Interpret user query Search index Present results Interact with user index Data source INTRODUCTION TUTORIAL OVERVIEW COMPONENTS

10. 9 Search Engine Architecture Backend Collect data Analyze data Store and index data Backend Collect data Analyze data Store and index data Admin System performance Search quality control Frontend Interpret user query Search index Present results Interact with user index Data source

11. 10 Main Backend Functions Analysis (Understand) Information extraction Analyse and transform data Indexing (Prepare for search) Generate terms suitable for match queries Index search terms index Document Ingestion (Collect) Collect all the data to be searched Transform and store as documents Local Analysis (in-document analysis) Global Analysis (cross-document analysis)

12. 11 Backend Section Outline Overview Data Ingestion Local analysis Global analysis Indexing

13. 12 Typical analytics pipeline S1={f11, f12, } S2={f21, f22, } S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data ingestion Collect data Transform to uniform document format Store in document store Data ingestion Collect data Transform to uniform document format Store in document store Document .. .. Document .. .. Document .. .. Document .. .. Global analysis Cross document analysis Rank, group, merge, and filter documents Global analysis Cross document analysis Rank, group, merge, and filter documents index Indexing Generate search terms, Index documents by search terms Indexing Generate search terms, Index documents by search terms Local analysis: Information extraction from each document Local analysis: Information extraction from each document DI BACKEND OVERVIEW

14. 13 Digression: Classical IR S1={f11, f12, } S2={f21, f22, } S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data ingestion Given set of files Data ingestion Given set of files Document .. .. Document .. .. Document .. .. Document .. .. Global analysis Calculate statistics of terms in documents Global analysis Calculate statistics of terms in documents index Indexing Generate search terms, Index by terms with statistics Indexing Generate search terms, Index by terms with statistics Local analysis: Tokenize Stop wording Stemming Form n-grams Local analysis: Tokenize Stop wording Stemming Form n-grams DI BACKEND OVERVIEW

15. 14 Digression: Classical Web search S1={f11, f12, } S2={f21, f22, } S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data ingestion Crawl web pages Data ingestion Crawl web pages Document .. .. Document .. .. Document .. .. Document .. .. Global analysis Calculate eigenvalues of connection matrix Global analysis Calculate eigenvalues of connection matrix index Indexing Generate search terms Index documents by search terms, with page rank Indexing Generate search terms Index documents by search terms, with page rank Local analysis: Extract out links Local analysis: Extract out links DI BACKEND OVERVIEW

16. 15 Demands of Enterprise Search S1={f11, f12, } S2={f21, f22, } S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data ingestion Handle variety of sources Handle variety of formats Deal with access policy Deal with update policy Data ingestion Handle variety of sources Handle variety of formats Deal with access policy Deal with update policy Document .. .. Document .. .. Document .. .. Document .. .. Global analysis Cross document analysis Rank, group, merge, and filter documents Global analysis Cross document analysis Rank, group, merge, and filter documents index Indexing Generate search terms, Index documents by search terms Indexing Generate search terms, Index documents by search terms Local analysis: Incorporate domain knowledge Extract rich set of semantics Categorize documents Local analysis: Incorporate domain knowledge Extract rich set of semantics Categorize documents DI BACKEND OVERVIEW

17. 16 Efficient incremental updates Fast turn around time for updates System performance and reliability Scaling with data size and resource available Fault tolerance Ease of administration quality improvement Allow search admin to customize domain specific configurations BACKEND OVERVIEW CHALLENGES / OPPORTUNITIES Desiderata of backend

19. 18 Data Ingestion BACKEND DATA INGESTION Doc. Store Crawl/push Web DB App Convert to document Convert to text From: xxx To: yyy Date: 12/21 .. .. Attch: file1.pdf Docid: 0002 ___________ .ABCD.. 01/12 .. .. Docid: 0001 ___________ From: xxx To: yyy Date: 12/21 .. .. Attch: file.pdf Email +attach Docid: 0002 ___________ title: ABCD. Date: 01/12 .. .. Docid: 0001 ___________ From: xxx To: yyy Date: 12/21 .. .. Attch: file.pdf Variety of sources Support update & retention policy Pdf file

20. 19 Document-centric View Data as a collection of documents Document as unit of storage and search result. Three major components Unique document identifier in the whole system Metadata fields: url, date, language, Content field: text to be searched Representation of data of different structures Web pages Each page is a document Relational data Each row is a document Hierarchical data Each node is a document BACKEND DATA INGESTION

21. 20 Push vs Pull Pull Push Definition Search engine initiate transfer of data (Web crawler) Content owner initiate transfer of data (Apps with push notice) Advantage Operated by search engine Use standard crawlers Can handle special access methods Easy to adjust refresh rate Easy to handle special format Disadvantage Difficult to access special data sources Difficult to adjust domain specific treatment Need synchronization with content owner Applicability Prevalent for Internet Also useful for enterprise Rare for Internet Very important for enterprise BACKEND DATA INGESTION

22. 21 Transform the Data Format conversion Convert content to text: pdf, doc, Keep as much structure as possible Metadata conversion Obtain and transform metadata: HTTP headers, DB table metadata, Merge /split documents One-to-many: Zip file, email thread, attachments Many-to-one: social tags merge to original doc BACKEND DATA INGESTION

23. 22 Storage options Options Pro Con SQL database Traditional RDBM strengths Support insert, update, delete, fielded query, Too much system overhead Indexing engine (Lucene) Closer to document centric view Supports insert, delete, fielded query No direct in-document update Need special treatment for distributed processing NoSQL databases Light weight Sufficient for simple use May lack features in the future Transaction? BACKEND DATA INGESTION Issues to consider In document update Access/Retention policy Parallel processing

25. 24 Local Analysis Annotating pages Extract structured elements: title, header, Extract features for people, projects, communities, Extract features for cross-document analysis. Categorizing pages Label by standard categories Language, geography, date, Label pages by custom categories IBM examples: HR, person, IT help, ISSI, sales information, marketing, corporate standards, legal & IP-law, Local analysis is essentially information extraction BACKEND LOCAL ANALYSIS

26. 25 Rule-based IE ML-based IE PRO Declarative Easy to comprehend Easy to maintain Easy to incorporate domain knowledge Easy to debug Trainable Adaptable Reduces manual effort CON Heuristic Requires tedious manual labor Requires labeled data Requires retraining for domain adaptation Requires ML expertise to use or maintain Opaque (not transparent) BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION Rule-based vs. Learning-based IE

27. 26 Commercial Vendors (2013) NLP Papers (2003-2012) 100% 50% 0% 3.5% 21% 75% Rule- Based Hybrid Machine Learning Based 45% 22% 33% Large Vendors 67% 17% 17% All Vendors GATE Information Extraction IBM InfoSphere BigInsights Microsoft FAST SAP HANA SAS Text Analytics HP Autonomy Attensity Clarabridge Example Industrial Systems Source: [CLR2013] Rule-based Information Extraction is Dead! Long Live Rule-based Information Extraction Systems!, EMNLP 2013 BACKEND LOCAL ANALYSIS INFORMATION EXTRACTION Landscape of Entity Extraction Implementations

28. 27 Intranet page NavPanel Extraction NavPanels Self link identification Title Extraction Matching title patterns Titles Dictionary Match Person name dictionary Person name in title Title Extraction Matching title patterns Titles Title Name URL Extraction URLs Matching URL patterns URL Name Person name dictionary = employee directory IBM Global Services Security Home IBM Global Services Security G J Chaitin Home Page G J Chaitin 1. http://w3-03.ibm.com/marketing/ 2. http://w3-03.ibm.com/isc/index.html 3. http://chis.at.ibm.com/ 1. marketing 2. isc 3. chis BACKEND LOCAL ANALYSIS EXAMPLES [Zhu et al., WWW07] Local analysis for different features

29. 28 Consolidation Example: Document language consolidation HTTP header Accept-Language: en-us,en;q=0.5 Meta tags Document text encoding URL http://enterprise.com/hr/benefits/us/ca/ BACKEND LOCAL ANALYSIS TRANFORMATIONS

31. 30 Global Analysis Deduplication Save resources, reduce result clutter Identify root of URL hierarchy Used for result grouping and ranking Anchor text analysis Assign external labels to documents Social tagging analysis Assign tags and their weights to documents Identify different versions of the same document Due to variations in date, language, Enterprise-specific global analysis When certain documents co-exists, do this BACKEND GLOBAL ANALYSIS

32. 31 Shingle based deduplication (Leskovec, http://www.mmds.org/) S1={s1, s2, } S2={s1, s3, } S3={s2, s3, } {h1(S1), h2(S2), } {h1(S2), h2(S2), } {h1(S3), h2(S2), } Document .. .. Shingles: Character or token n-gram Possibly stemmed Possibly related to stop words Shingles: Character or token n-gram Possibly stemmed Possibly related to stop words Document .. .. Document .. .. Document .. .. Minhash: Maps sets to integers Based on permutation of universal set Jaccard distance : Theorem: The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets Minhash: Maps sets to integers Based on permutation of universal set Jaccard distance : Theorem: The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets | AB | / | AB | More diverse set of documents. More precise. BACKEND GLOBAL ANALYSIS DEDUPLICATION

33. 32 Metadata-based deduplication (IBM Gumshoe search engine) S1=[h11, h12, ] S2=[h21, h22, ] S3=[h31, h32, ] G1 = {S1, } G2 = {S2, S3, } Document .. .. Significant metadata: Document title Section headers Signatures from URL Ensure that all similar candidates have the same signature Significant metadata: Document title Section headers Signatures from URL Ensure that all similar candidates have the same signature Document .. .. Document .. .. Document .. .. Group by signature Perform detailed analysis In-group similarity analysis: Analyze documents within candidate groups Group by signature Perform detailed analysis In-group similarity analysis: Analyze documents within candidate groups More customizable for intranet. Less cost. BACKEND GLOBAL ANALYSIS DEDUPLICATION

34. 33 URL Root Analysis (Zhu et al., WWW07) host1/b/a/~user1/pub host1/b/a host1/b/a/~user1/ host1/b/c host1/b/a/x_index.htm/ host1/b/c/d host1/b/c/home.html host1/b/c/d/e/index.html?a=us host1/b/c/d/e/index.html?a=uk host1/b/c/d/e/index.html Given a set of documents all with the same value V of feature X. E.g., At one time all webpages from IBM Tucson site had the same title Find the roots of URL forest. These will be preferred result for query X=V. E.g., when searching for Tuscon home page, only the IBM Tuscon homepage will match. BACKEND GLOBAL ANALYSIS ROOT ANALYSIS

35. 34 Label Assignment (Zhu et al., WWW07) BACKEND GLOBAL ANALYSIS LABEL ASSIGNMENT Document B .. .. Document A1 .. X home .. Document A2 .. X home .. Bookmark C1 X home Anchor text global analysis: Assign label X and / or Y based on frequency Bookmark C2 X Bookmark C3 Y home Document A2 .. X home .. Social tagging global analysis: Assign label X home, X, and Y home based on frequency

36. 35 Entity Integration using HIL Entity Population Rules Create entities (from raw records, other entities, and links) Clean, normalize, aggregate, fuse Various data sources Information Extraction Entity Resolution Fuse Aggregate Entity Integration Entity Resolution Rules Create links between raw records or entities Map Unstructured Data Unified entities Defines entity types (the logical data model of the integration flow) (SQL-like) rules to specify the integration logic Raw Records HIL [Hernndez et al, EDBT13] Declarative IE (IBM SystemT) [Chiticariu et al, ACL 2010] Optimizing compiler to Big Data runtime (Jaql and Hadoop) BACKEND GLOBAL ANALYSIS ENTITY INTEGRATION

38. 37 Indexing Generate and index search terms, to be matched by terms generated at runtime from user queries. Challenges: Extracted terms do not match user query terms Morphological changes, synonyms, Importance of term depends on query Needs for bucketing of indexes, Support of incremental indexing BACKEND INDEXING

39. 38 Term normalization Example: Date time normalization Given any of these Wed Aug 27 10:06:11 PDT 2014 27 Aug 2014, 10:06:11 2014-08-27T10:06:11-07:00 27 Aug 2014 1409133971 Normalize to 2014-08-27T10:06:11-07:00 Other examples: Person names, product names, BACKEND INDEXING TERM NORMALIZATION

40. 39 Why Generate Variant Terms? Extracted feature string query string People names Document: John Doe Search: Doe, John Search: J Doe Acronym expansions gts Global Technology Services N-gram variant generation Title: reimbursement of travel expenses Terms: reimbursement, travel expenses, reimbursement travel, reimbursement of travel, reimbursement expenses Normalization is not sufficient solution People names Document: John Doe J. Doe Search: Jean Doe J. Doe These are not supposed to match Solution: Generate variant terms with different levels of approximation. BACKEND INDEXING VARIANT TERM GENERATION

41. 40 Configurable Term Generation Configuration knobs determine the set of outputs Given Mr. John (Jack) M. Doe Jr. Configuration1: Initial=both, Dot: with, NickName: both, MiddleName: both, NameSuffix: without, Title: without, Comma:both John M. Doe Doe, John M. John Doe Doe, John J. M. Doe Doe, J. M. J. Doe Doe, J. Jack M. Doe Doe, Jack M. Jack Doe Doe, Jack Configuration2 (normalization): Initial=without, Dot: without, NickName: without, MiddleName: without, NameSuffix:without, Title: without, Comma: without John Doe BACKEND INDEXING VARIANT TERM GENERATION

42. 41 Enterprise Search Backend S1={f11, f12, } S2={f21, f22, } S3={f31, f32, } G1 = {g1, } G2 = {g2, g3, } LA GA Idx Data ingestion Access various sources Document transform Format transform Data ingestion Access various sources Document transform Format transform Document .. .. Document .. .. Document .. .. Document .. .. Global analysis Deduplication URL root analysis Label assignment Global analysis Deduplication URL root analysis Label assignment index Indexing Generate search terms, Indexing Generate search terms,Local analysis: Information extraction Configurable Local analysis: Information extraction Configurable DI BACKEND RECAP

43. 42 Search Engine Architecture Backend Collect data Analyze data Store and index data Backend Collect data Analyze data Store and index data Admin System performance Search quality control Frontend Interpret user query Search index Present results Interact with user Frontend Interpret user query Search index Present results Interact with user index Data source

44. Serving User Queries at Front End (52) 1. Ambiguity (29) 2. Ranking (3) 3. Representation (6) 4. Expert Search (6) 5. Privacy (8)

45. 44 1. Ambiguity Optimal keywords may not be used. Misspelled datbase Under-specified polysemy: java too general: database papers Over-specified: synonyms, acronyms, abbreviations & alternative names: green card permanent residency too specific: MS Office 2007 for Mac x64 edition Non-quantitative: small laptop query cleaning query autocompletion query refinement query rewriting query rewriting

46. 45 Summary of Solutions query cleaning correct various types of spelling errors query autocompletion prevent spelling errors. query refinement making queries more specific, returning fewer results. query rewriting making queries more general / on-topic, returning more relevant results. query forms enabling users to specify precise queries FRONTEND AMBIGUITY

47. 46 Graph-based Spelling Correction (bao acl 11) Repartition the query. Each partition (token) should be plausible: confidence (correcting it) > threshold. confidence: linear combination of multiple scores, parameters learned from SVM. Domain knowledge is often used in calculating confidence. For each partition, generate candidate corrections with high scores. enterpricsea rch enterpricse arch enterpric search enter pric search etc. price: 0.8 prim: 0.6 etc. pric QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY enterpricsea rch

48. 47 Graph-based Spelling Correction (bao acl 11) Build a graph that connects candidate corrections. Each full path is a candidate query. Find k top-weighted full paths enterprise enter price prim arc sea rich search 1. correction score (node weight) 2. merge penalty (node weight) 3. split penalty (edge weight) enterprise search enter price sea rich e.g., weights QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY price: 0.8 prim: 0.6 etc. pric enterpricsea rch

49. 48 Graph-based Spelling Correction (bao acl 11) Weight doesnt consider term correlations. Calculate a score for each path Score includes term correlations. This ensures the cleaned query has good quality results. Correlations are computed based on number of co- occurrences. Finally returns paths with high scores. e.g., correlation(enterprise search) > correlation (enterprise arc) QUERY CLEANING UNSTRUCTURED DATAFRONTEND AMBIGUITY e.g., enterprise search vs. enterprise arc

50. 49 XClean (lu icde 11) based on the noisy channel model that finds the intended word given the users input word. results on XML are subtrees rooted at entity nodes. A result quality score is calculated for each entity node in T, and then aggregated. e.g., if Johnny and Mike works in the same department, then Johnn, Mike Johnny, Mike rather than John, Mike. processes each word individually, i.e., no merge or split. Query Cleaning on Relational Data: Pu VLDB 08 related department head Johnny employees QUERY CLEANING STRUCTURED DATAFRONTEND AMBIGUITY

52. 51 Query Autocompletion Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY

53. 52 Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching Error-Tolerating Autocompletion (chaudhuri sigmod 09) desr desert dessert deserve QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY

54. 53 n c ae x Error-Tolerating Autocompletion (chaudhuri sigmod 09) data contains search, sand and text max. edit distance = 1 no input input: s input: se input: sen s a r t e t h d n c ae x s a r t e t h d n c ae x s a r t e t h d n c ae x s a r t e t h d Showing results instead of keywords can be achieved by associating inverted lists to trie nodes. trie QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY

55. 54 Tastier(li vldbj 11) Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching have a nni show results for have a nice day QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY

56. 55 Tastier(li vldbj 11) Trie-based (similar as previous paper). Trie leaf nodes are associated with inverted lists. To handle multiple keywords: Each record/document is associated with a sorted lists of words in it (forward lists). so that a binary search can determine whether a string appears in a record/document as a prefix. why not hash? Because we need to match prefix, not whole words. Inverted list intersections are computed incrementally using cache for improved efficiency. have a nice day a, day, have, nice example forward list QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY

57. 56 Phrase Prediction(nandi vldb 07) Problem Space Dimensions showing keywords vs. showing results single keyword vs. multiple keyword exact matching vs. fuzzy matching QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY a nice have a nice day

58. 57 Phrase Prediction(nandi vldb 07) Suggest phrases given the user input phrase. Need to find a good length of a suggested phrase. Too short: utility is small. Too long: low chance of being accepted. (modified) suffix tree-based. Each node is a word, rather than a letter. Why not use trie: phrases have no definitive starting point. A phrase may start in the middle of a sentence (i.e., start at a suffix of the sentence), hence suffix tree. Significant phrases. laptop have a nice day QUERY AUTOCOMPLETIONFRONTEND AMBIGUITY

60. 59 Query Refinement Motivation Some under-specified queries on large data corpus have too many results. Ranking cannot always be perfect. Approaches Identifying important terms in results (structured/unstructured) Clustering results (structured/unstructured) Faceted search (structured) FRONTEND AMBIGUITY QUERY REFINEMENT

61. 60 Using Clustered Results (liu pvldb 11) All suggested queries are about programming language. It is desirable to refine an ambiguous query by its distinct meanings. Java FRONTEND AMBIGUITY QUERY REFINEMENT

62. 61 Input: clustered results clustering method is irrelevant. e.g., the result of Java may have 3 clusters corresponding to Java language, Java island, and Java tea. Output: one refined query for each cluster. Each refined query: maximally retrieves the results in its cluster (recall) minimally retrieves the results not in its cluster (precision) Using Clustered Results (liu pvldb 11) FRONTEND AMBIGUITY QUERY REFINEMENT

63. 62 Using Important Terms in Results (tao edbt 09) For relational data only. Given a keyword query, it outputs top-k most frequent non-keyword terms in the results, without generating the results. Avoiding result generation is possible since the terms are ranked only by frequency: tradeoff of quality and efficiency. Data Clouds (for structured data): Koutrika EDBT 09 (more sophisticated term ranking, but needs to generate query results first.) related FRONTEND AMBIGUITY QUERY REFINEMENT

64. 63 Faceted Search all location: Sunnyvale, CA location: Phoenix, AZ location: Amherst, MA department: data management department: machine learning 1. How to select facets and facets conditions at each level, to minimize the users expected navigation cost? 2. How to rank facets and facets conditions? challenges Chakrabarti SIGMOD 04 Kashyap CIKM 10 FRONTEND AMBIGUITY QUERY REFINEMENT

66. 65 Query Rewriting Motivation Synonyms, alternative names: green card vs permanent residency. Too specific: MS Office 2007 for Mac x64 edition Non-quantitative: small laptop Approaches Using query/click logs Finding rewriting rules from missing results e.g., replace green card with permanent residency. Using differential queries FRONTEND AMBIGUITY QUERY CLEANING

67. 66 Using Query and Click Logs (cheng icde 10) The availability of query and click logs can be used to assess ground truth. query Q query log click log synonyms hypernyms hyponyms of Q query search synonym MySQL database hypernym database MySQL hyponym find and return historical queries whose ground truth (via click log) significantly overlaps with top-k results of Q. idea FRONTEND AMBIGUITY QUERY CLEANING

68. 67 Automatic Suggestion of Rewriting Rules from Missing Results (bao sigir 12) Challenges for automatically generating rewriting rules: rules should be semantically natural. a new rule designed for one query may eliminate good results of another query. FRONTEND AMBIGUITY QUERY CLEANING green card result d is missing / should be ranked higher result d contains phrase permanent residency rewriting rule: green card permanent residency

69. 68 Input: query q, missed desirable results d Output: selected set of rules Generate candidate rules L R. L: n-grams in q. R: n-grams in high- quality fields of d. Identify semantically natural rules by machine learning. Greedily select a subset of rules that maximizes the overall query quality. Automatic Suggestion of Rewriting Rules from Missing Results (bao sigir 12) FRONTEND AMBIGUITY QUERY CLEANING green card permanent residency green card federal government

70. 69 Keyword++ (Entity Databases) (xin pvldb 10) small IBM laptop ID Product Name BrandName Screen Size Description 1 ThinkPad E545 Lenovo 15 The IBM laptop...small business 2 ThinkPad X240 Lenovo 12 This notebook... To understand a term, compare two queries that differ on this term, and analyze the differences of attribute value distributions in the results. idea e.g., to understand term IBM, we can compare the results of IBM laptop vs. laptop. FRONTEND AMBIGUITY QUERY CLEANING

71. 70 Suppose: IBM laptop 50 results, 30 having brand: Lenovo laptop 500 results, only 50 having brand: Lenovo The difference on brand: Lenovo is significant, reflecting the meaning of IBM. IBM brand: Lenovo small order by size ASC Offline: compute the best mapping for all terms in query log Online: compute the best segmentation of the query (DP). laptop small laptop likewise: Keyword++ (Entity Databases) (xin pvldb 10) FRONTEND AMBIGUITY QUERY CLEANING

73. 72 Offline: how many query forms, and which query forms, should be generated? Too many hard to find the relevant forms. Too few limiting query expressiveness. Online: how to identify query forms relevant to users search needs? Query Forms Enabling users to issue precise structured queries without mastering structured query languages. advantage challenges Baid SIGMOD 09 Jayapandian PVLDB 08 Ramesh PVLDB 11 Tang TKDE 13 FRONTEND AMBIGUITY QUERY FORMS

75. 74 2. Ranking Ranking Method Categories Unstructured Data represents queries and documents using vectors each component is a term; the value is its weight ranking score = similarity (query vector, result vector) Structured Data a document a node or a result (subgraph/subtree) vector space model proximity based ranking authority based ranking FRONTEND RANKING

76. 75 2. Ranking Ranking Method Categories Unstructured Data proximity of keyword matches in a document can boost its ranking. Structured Data weighted tree/graph size, total distance from root to each leaf, semantic distance, etc. vector space model authority based ranking proximity based ranking FRONTEND RANKING

77. 76 2. Ranking Ranking Method Categories vector space model Unstructured Data nodes linked by many other important nodes are important. Structured Data authority may flow in both directions of an edge different types of edges in the data (e.g., entity-entity edge, entity-attribute edge) may be treated differently. proximity based ranking authority based ranking FRONTEND RANKING

79. 78 3. Representation Enterprise corpus can be much more heterogeneous than a collection of documents or web pages. Different searches may have different types: retrieving a document, a figure, a tuple, a subgraph, analytical keyword queries, etc. Result diversification Result summarization Result differentiation solutions FRONTEND REPRESENTATION

80. 79 Result Diversification Result diversification is essentially the same problem as query refinement. e.g., Java Java language, Java tea, Java island. Same techniques apply. FRONTEND REPRESENTATION DIVERSIFICATION

81. 80 Result Summarization Unstructured data: lots of work on text summarization in machine learning, natural language processing and IR communities. Structured data: Size-l object summary (Relational) Result snippet (XML) Das, CMU 07 (unpublished) Nenkova, Mining Text Data 12 surveys FRONTEND REPRESENTATION SUMMARIZATION

82. 81 Size-l Object Summary (fakas pvldb 11) Mike first window Mike unstructured Mike paper paper patent patent conference John ? structured FRONTEND REPRESENTATION SUMMARIZATION

83. 82 Size-l Object Summary (fakas pvldb 11) Each tuple has: a static importance score. similar idea as PageRank a run-time relevance score. distance to result root connectivity properties to result root Objective: find a connected snippet of the result, which consists of l tuples and has the maximum score. Dynamic programming based solution. Result snippet for XML: Liu TODS 10 related FRONTEND REPRESENTATION SUMMARIZATION

84. 83 Result Differentiation Result 1 Result 2 event: year 2000 2012 paper: title OLAP data mining cloud scalability search NEC Labs Open House result 1: a large table with many people / papers / posters result 2: a large table with many people / papers / posters results result differentiation vs. comparing different credit cards on a bank website: only with pre-defined features. FRONTEND REPRESENTATION DIFFERENTIATION

85. 84 4. Expert Search documents in which a candidate and a topic co-occur topics near a candidate in a document problem solving / ticket routing history users knowledge on a topic expert should be more knowledgeable social relationship between expert and user problem solving is usually more effective if expert has a close social relationship with user external corpus many employees publish stuff externally, i.e., papers, blogs. ways for judging an expert Find an expert within an enterprise to solve a particular problem. goal FRONTEND EXPERT SEARCH

86. 85 Classical Methods Builds a feature vector for each expert using various evidence Ranks experts based on query, using traditional retrieval models candidate model First finds documents related to query, then locates experts in documents Mimics the process a human takes. document model Balog CIKM 08 survey FRONTEND EXPERT SEARCH

87. 86 User-Oriented Model (smirnova ecir 11) Users prefer experts who: are more knowledgeable than themselves. knowledge gain: p(e|q) p(u|q) have a close social relationship with themselves. time-to-contact: shortest path department head John employees e = expert u = user FRONTEND EXPERT SEARCH

88. 87 Using Web Search Engine (santos inf. process. manage. 11) query q result from intranet web query q result from internetformulate web query search intranet corpus combine candidates full name: Jeff Smisek organizations name: IBM terms in q: data integration excluding results from organization: -site:ibm.com FRONTEND EXPERT SEARCH

89. 88 Ticket Routing (shao kdd 08) new ticket: DB2 login failure transferred to group A transferred to group B transferred to group C resolved How to find the best group and reduce problem solving time? Markov chain model Using only previous routing history (not ticket content) FRONTEND EXPERT SEARCH

90. 89 Ticket Routing (shao kdd 08) Pr(g|S) possibility to route a ticket to group g given previous groups S Pr(g|S) includes the probability that: g can solve the ticket g can correctly re-route the ticket. Train the Markov chain model from ticket routing history. FRONTEND EXPERT SEARCH

92. 91 5. Privacy It is sometimes desirable that the search engine doesnt know which documents a user wants to retrieve. For users: privacy For enterprises: avoiding liability user privacy While a search engine answers individual keyword searches, there are methods that perform multiple searches and, from the answers, piece together aggregate information about underlying corpus. Enterprises may not want to disclose such information to all users. data privacy

93. 92 User Privacy Private Information Retrieval (PIR) old topic, tons of theoretical papers Modifying search engine. e.g., forcing it to forget user activities embellishing queries with decoy terms (Pang PVLDB 10) Using ghost queries to obfuscate user intention (Pang ICDE 12) no change to search engine light-weight solutions It is sometimes desirable that the search engine doesnt know which documents a user wants to retrieve. For users: privacy For enterprises: avoiding liability user privacy

94. 93 Private Information Retrieval (PIR) Idea: retrieve more documents than needed. Nave: retrieve the entire corpus. How to minimize the number of retrieved & unneeded documents? Tons of theoretical papers on different variations of the problem, e.g., different computation power of the search engine different number of non-communicating corpus replica. Gasarch EATCS Bulletin 2004 survey

95. 94 Ghost Queries (pang icde 12) Challenges Generate ghost queries on topics different from users topics of interest, and make it difficult for the search engine to infer users topics. Ghost queries need to be meaningful/realistic, so that they cannot be easily identified. generate ghost queries ghost queries discard ghost query results results submit to search engine user query

96. 95 Ghost Queries (pang icde 12) (e1, e2) privacy model Given a user query, if the probability of a topic increases more than e1, it should be reduced to below e2 by the ghost queries. Topics are predefined. A ghost query must be coherent: all words in the ghost query should describe common or related topics. Randomized algorithm based solution.

97. 96 Data Privacy While a search engine answers individual keyword searches, there are methods that perform multiple searches and, from the answers, piece together aggregate information about underlying corpus. Enterprises may not want to disclose such information to all users. data privacy inserting dummy tuples OR randomly generating attribute values only applicable to structured data disallowing certain queries OR return snippets search quality loss altering a small number of results: adding dummy results; modifying results, hiding some results (Zhang SIGMOD 12) solutions FRONTEND PRIVACY

98. 97 Aggregate Suppression (zhang sigmod 12) Example: consider corpus A and B. A: n documents B: 2n documents A B Goal: suppress COUNT(*), i.e., adversary cannot tell which corpus is larger. Nave approach 1: deterministically remove n documents from B. achieves the goal, but with search utility loss: those n documents can never be retrieved. Nave approach 2: randomly drop half of the results at run time. no search utility loss, but fails to achieve the goal: a clever adversary can still get the information. FRONTEND PRIVACY

99. 98 Aggregate Suppression (zhang sigmod 12) Algorithm ideas carefully adjusting query degree (number of documents matched by a query) and document degree (number of queries matching a document) by document hiding at run-time. decline a query if its result can be covered by a small number of previous queries. Return previous query results instead. FRONTEND PRIVACY

100. 99 Backend Collect data Analyze data Store and index data Admin System performance Search quality control/improvement Admin System performance Search quality control/improvement Frontend Interpret user query Search index Present results Interact with user index Data source Tutorial Outline

101. 100 Enterprise Search Administrators Main responsibilities Care and feeding of an enterprise search solution Monitor intranet help inboxes and respond to requests. Assist in troubleshooting intranet issues for content contributors Core skills required Understand general corporate business processes Experience in coordinating activities and managing relationships with employees, content administrators, stakeholders, IT teams and external agencies Search Admin Search administrators IR experts Key Observation Admin Overview

102. 101 What a Search Administrator Need? Bad results for query Im missing the golden URL Result 22 should be ranked much higher! Enterprise Users Query Logs Query global campus seems unsatisfying Understand overall search quality Overall trend YOY change By segmentation Understand individual search results Why certain result is or isnt brought back Its ranking Maintain search quality Underlying data evolves Terminology changes Policy/Business Process changes Organization changes Hot topics Search Admin Admin Overview

103. 102 Understand Search Quality 102 (Google Search analytics) Admin Examples

104. 103 Understand Search Quality (Google Search analytics) Admin Examples

105. 104 What a Search Administrator Need? Bad results for query Im missing the golden URL Result 22 should be ranked much higher! Enterprise Users Query Logs Query global campus seems unsatisfying Understand overall search quality Overall trend YOY change By segmentation Understand individual search results Why certain result is or isnt brought back Its ranking Maintain search quality Underlying data evolves Terminology changes Policy/Business Process changes Organization changes Hot topics Search Admin Admin Examples

106. 105 Gumshoe Search Quality Toolkit 105 (bao cikm 12) Admin Examples

107. 106 Gumshoe Search Quality Toolkit 106 (bao cikm 12) Understand individual query Admin Examples

108. 107 Gumshoe Search Quality Toolkit 107 (bao cikm 12) Examine search results Admin Examples

109. 108 Gumshoe Search Quality Toolkit 108 (bao cikm 12) Understand why a result is returned Admin Examples

110. 109 Gumshoe Search Quality Toolkit 109 (bao cikm 12) Understand the ranking of the result Admin Examples

111. 110 Gumshoe Search Quality Toolkit 110 (bao cikm 12) Investigate a desired result Admin Examples

112. 111 Gumshoe Search Quality Toolkit 111 (bao cikm 12) Suggest rewrite rules Admin Examples

113. 112 Gumshoe Search Quality Toolkit 112 (bao cikm 12) Edit runtime rules Admin Examples

114. Enterprise Search in the Big Data Era Case Study: IBM Intranet Search

115. 114 Experience at IBM Internal Search IBM deployed a commercially available search engine Implementing standard IR techniques Search quality went down over time to the point that Search results were unacceptable! Success ( 1 relevant results): 14% on top-1, 23% on top-5, 34% on top-50! [Zhu et al., WWW07] So, they implemented various solutions To the administrators managing the engine, exposed control knobs were insufficient Case Study Background

116. 115 Attempts to Improve Search Enhanced link analysis by incorporating external links to/from external WWW Creative hacks: added fake terms to documents & queries # terms per document determined by popularity: how much TF increase required for needed rank boost ? Hard-coded custom results for the top 1200+ queries Didnt help Quality went down! Maintenance nightmare: Heuristic needs to be updated upon each nontrivial change in term stats./ranking parameters Even bigger nightmare! How to deal with continuously changing terminology? Case Study Background

117. 116 Goals of Gumshoe Network Station Manager search Thin Client ManagerProduct names change: Continually changing terminology Domain-specific meaning Paula Summa search bring Paula Summa from employee directories per diem search Domain-specific repetitions popcorn search conference call! Result 1: IBM Travel: Per Diem Result 2: IBM Travel: Per Diem Rates Result 3: IBM Travel: National perdiems Result 25: IBM Travel: Per Diem Policy Gumshoe: Generic search solution, customizable & maintainable in many domains Simple customization with reasonable effort Ongoing search-quality management Philosophy: programmable search Case Study Background

118. 117 Programmable Search: Main Idea Goals: Transparency Know precisely why every result item is being brought back Understand how changes in content/intents affect search Maintainability and Debugability Ranking logic is guided by explicit rules Properly react to changes in content/intents Building blocks: Deep analytics on documents Domain-specific analysis of queries Transparent customizable rule-driven ranking runtime rules backendbackend analytics interpretations Case Study Background

119. 118 Distributed Analytics Platform (IBM InfoSphere BigInsights) Crawling, information extraction, token generation (TG), indexing Search runtime Index Index and rule update services backendbackend analytics runtime rulesinterpretations backend frontend Implementation Architecture Case Study Background

120. 119 Backend Analytics: 3 Parts Local Analysis (per-page analysis) Global Analysis (cross-page analysis) Token Generation (TG) index Case Study Background

121. 120 Local Analysis Categorizing pages Label pages by custom categories IBM examples: HR, person, IT help, ISSI, sales information, marketing, corporate standards, legal & IP-law, Geo classification Associate documents with the relevant countries & regions Annotating pages Identify HomePage annotation for people, projects, communities, Simply knowing where a page is physically hosted is not enough (example: Czech Republic hosts all pages for IBM in Europe) Case Study Backend Local Analysis

122. 121 Declarative approach Define an operator for each basic operation Input tuple of annotations Output tuples of annotations Compose operators to build complex extractors Algebraic expression One document at a time trivial parallelism. Benefits of declarative approach: Expressivity: Richer, cleaner rule semantics Performance: Better performance through optimization Declarative IE System Case Study Backend Local Analysis

123. 122 InfoSphere Streams Cost-based optimization ... SystemT Overview InfoSphere BigInsights SystemT RuntimeSystemT Runtime Input Documents Extracted Objects SystemTSystemT IBM Engines UIMA SystemT Highly embeddable runtime AQL Extractors Embedded machine learning model AQL Rules create view SentimentForCompany as select T.entity, T.polarity from classifyPolarity (SentimentFeatures ) T; create view Company as select ... from ... where ... create view SentimentFeatures as select from ; Case Study Backend Local Analysis

124. 123 G J Chaitin Home Page Homepage Identification Title Extraction Matching titleMatching title patterns Title s Dictionary Match Home Page for G J Chaitin http://w3.ibm.com/hr/idp/ http://w3-03.ibm.com/isc/index.html http://chis.at.ibm.com/ URL Extraction URLs Matching URLMatching URL patterns Homepage for: idp isc chis Employee directory many more Intranet page [Zhu et al., WWW07] Case Study Backend Local Analysis

125. 124124 IBM Confidential124 IBM Confidential Among the 38 pages with the exact same title, which is the best for Paula Summa? Role of Global Analysis Case Study Backend Global Analysis

126. 125 Person Title Token Generation (TG) Annotated values Index content Ching-Tien T. (Howard) Ho Ho Ching-Tien Tien Ho Ho, Tien Howard Ho Ching-Tien H. ... Global Technology Services TG Howard Ho Ching Tien ... gts Global Technology Services Global Technology Technology Services Global Technology ... GlobalTechnologyServices nGramTG spaceTG Case Study Backend Token Generation

127. 126 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics Rewrite rules Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction Grouping rules Re-ranking rules Case Study Frontend

128. 127 Phase 3: Result Construction Phase 2: Relevance Ranking Phase 1: Query Semantics query search rewrite rules queries interpretations partially ordered interpretations interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules Runtime Flow in More Details Case Study Frontend

129. 128 Runtime Rules: Pattern-Action Language (Fagin 2012) Query Pattern Queries Matching Possible Action EQUALS [r=ibm|information|info] [d=COUNTRY] ibm germany info india Rewrite into [country] hr (e.g., germany hr) ENDS_WITH installation acrobat installation db2 on aix installation Replace installation with ISSI (e.g., acrobat ISSI) CONTAINS directions to [d=SITE] driving directions to almaden directions to watson from jfk Pages of siteserv category should be ranked higher STARTS_WITH [d=PERSON] john kelly biography steve mills announcement Group together pages that represent blog entries Pattern expression, matched against the keyword query Perform when matchQuery pattern Action Similar to the query-template rules of Agarwal et al. [WWW 2010] Query SemanticsCase Study Frontend

130. 129129 Whats Best for Benefits? Query SemanticsCase Study Frontend

131. 130130 The most important IBM page for benefits changes over time: currently it is netbenefits Whats Best for Benefits? Query SemanticsCase Study Frontend

132. 131 Rewrite Rules benefits netbenefits Query SemanticsCase Study Frontend

133. 132 Rewrite Rules benefits netbenefits interpretations partially ordered interpretations interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules benefits, netbenefits benefits netbenefits rewrite rules queries benefits search Query SemanticsCase Study Frontend

134. 133 133 IBM Confidential People with first name Jim How can we avoid pages from people category? java jim Complex Rules Query SemanticsCase Study Frontend

135. 134 134 IBM Confidential Complex Rules java jim and not in person category Query SemanticsCase Study Frontend

136. 135 135 IBM Confidential Complex Rules java jim and not in person category interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries java search Query SemanticsCase Study Frontend

137. 136 InterpretationsScenario: An IBM employee wants to download Lotus Symphony 1.3 Runtime interpretation: download symphony 1.3 category=issi software=symphony 1.3 Query SemanticsCase Study Frontend

138. 137 IBM Confidential Complex Rules java jim and not in person category interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries java search Query SemanticsCase Study Frontend

139. 138 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics Rewrite rules Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction Grouping rules Re-ranking rules Relevance RankingCase Study Frontend

140. 139 Person Title Recall: Token Generation (TG) Annotated values Index content Ching-Tien T. (Howard) Ho Global Technology Services TG Howard Ho Ching Tien ... gts Global Technology Services Global Technology Technology Services Global Technology ... GlobalTechnologyServices nGramTG spaceTG Ho Ching-Tien Tien Ho Ho, Tien Howard Ho Ching-Tien H. ...Person + personNameTG Person + nGramTG Title + acronymTG Title + spaceTG Title + nGramTG Relevance RankingCase Study Frontend

141. 140 Annotation + TG Relevance Bucket Howard Ho Ching Tien ... GlobalTechnologyServices Person + personNameTG Person + nGramTG Title + acronymTG Title + spaceTG Title + nGramTG query search Relevance buckets Buckets are ranked Based on annotation type Based on TG quality A page can belong to multiple buckets Within each bucket, ranking is by conventional IR Relevance RankingCase Study Frontend

142. 141 Ranking by Relevance Buckets grouping rules ordered & grouped results final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries interpretations execution partially ordered results result aggregation ordered results employment verification search Relevance RankingCase Study Frontend

143. 142 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics Rewrite rules Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction Grouping rules Re-ranking rules Result ConstructionCase Study Frontend

144. 143 Grouping Rules Grouping rules define how search results should be grouped together Search administrators can improve the diversity of search results (in 1st page) Based on their familiarity with the data sources Group pages of the same category per diem travel, you-and-ibm ANY ISSI, IT Help Central, Forum, Bluepedia, Media Library, Query pattern Result ConstructionCase Study Frontend

145. 144 Need first page diversity Flooding with Similar Pages Result ConstructionCase Study Frontend

146. 145145 IBM Confidential per diem travel, you-and-ibm Grouping Rule to the Rescue Result ConstructionCase Study Frontend

147. 146146 IBM Confidential per diem travel, you-and-ibm final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results per diem search Grouping Rule to the Rescue Result ConstructionCase Study Frontend

148. 147 Re-ranking Rules Re-ranking rules adjust ranking of search results based on categories Example: search administrator specifies the important sources of hot/current topics Hot topics Rank these categories higher Bluepedia, News, About-IBM smarter planet, cloud computing, centennial, Result ConstructionCase Study Frontend

149. 148 Bluepedia Technical News Homepages of About IBM Hot topics Rank these categories higher Bluepedia, News, About-IBM smarter planet, cloud computing, centennial, Re-ranking Rule for Hot Topics Result ConstructionCase Study Frontend

150. 149 Re-ranking Rules for Person Queries [d=PERSON] executive_corner, media_library, organization_chart, files Result ConstructionCase Study Frontend

151. 150150 IBM Confidential per diem travel, you-and-ibm final results re-ranking rules interpretations partially ordered interpretations rewrite rules queries interpretations execution partially ordered results result aggregation ordered results grouping rules ordered & grouped results per diem search Grouping Rule to the Rescue Result ConstructionCase Study Frontend

152. 151 3 Phases of Runtime Flow Search Query Phase 1: Query Semantics Rewrite rules Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction Grouping rules Re-ranking rules Case Study Frontend

153. 152 What Administrators Need Search administrators have major problems with an opaque search engine Programmable search provides Customization to the specific domain Ongoing search-quality management Allows the building of search quality toolkit. Recap: Case Study Admin

154. 153 Gumshoe Search Quality Toolkit! Case Study Admin

155. Demo

156. 155 Demo Case Study Admin

157. 156 Proof of Pudding is in the Eating Immediate Positive Impact within first 3 months Improve natural clickthrough rate by 100%+ Top 5 results: selected about 90% of the time Sustained search quality Improvements 4 years since going alive Stable natural search click through rate Gumshoe (Aug. 2011 Oct. 2011) Old Intranet Search (Aug. 2010 Aug. 2011) Natural clickthrough rate Case Study Results

158. 157 Summary Programmable search: Simple & flexible customization Search quality management Backend Analytics Local analysis (per-page analysis) Global Analysis (cross-page analysis) Token Generation (TG) [Fagin et al., PODS10, PODS11] Tooling Search provenance Rule suggestion Utilization of relevance buckets [Li et al., SIGIR06, Zhu et al., WWW07] Phase 1: Query Semantics Rewrite rules Query interpretation Phase 2: Relevance Ranking By relevance buckets + conventional IR Phase 3: Result Construction Grouping rules Re-ranking rules [ Bao et al., ACL2010, SIGIR2012 CIKM2012] Case Study Summary

159. Enterprise Search in the Big Data Era Future Directions

160. 159 Search Engine Components Backend Collect data Analyze data Store and index data Admin System performance Search quality control/improvement Frontend Interpret user query Search index Present results Interact with user index Data source

161. 160 Future Directions Data Heterogeneity A rich variety of data types need to be searched in enterprises. docs, databases, images, videos, social graphs, etc. observations How to automatically identify relevant data types, and search and rank across different data types? e.g., for image search, should image recognition techniques be incorporated in enterprise search engines? If so, how? questions

162. 161 Future Directions Data Freshness New data is continuously collected and published in enterprises, the rate of which can be very fast. Web search engines are not required to index new websites quickly, but in enterprises, new contents may need to be searchable asap. observations How to build efficient real-time indexes to ensure data freshness in enterprise search? questions

163. 162 Future Directions Search Context Enterprise search users have richer profiles than web users. activities, bio, position, projects, experiences, etc. observations How to utilize users contexts to provide customized results? Is it possible to predict the information a user may want, and push it to the user? questions

164. 163 Future Directions User Preference Different users in an enterprise have different expertise, and may prefer different ways to express queries. e.g., some users prefer pure keyword search, while others may want lightly-structured queries. observations How to effectively satisfy different needs for expressing queries for different users? questions

165. 164 Future Directions Question Answering The purpose of many enterprise searches are to find answers to questions. e.g., what is the previous name of a product, and when did we change to the current name? observations Is it possible to effectively use natural language processing techniques and domain knowledge to automatically answer natural language questions? questions

166. 165 Future Directions Transactional Search Over 1/3 of enterprise search queries is transactional. It will be desirable if enterprise search engines can recommend business processes to accomplish a certain task given a transactional search. E.g., given a customers lengthy complaint letter, how to find out the departments relevant to the complaints. observations How to better support transactional search? How to initiate a business process based on the results of a search? questions

167. 166 Future Directions Big Data Analytics Rich information and knowledge lies in big data. Many employees (not just data analysts) may benefit from the ability to perform analytics on the companys big data. observations How to build a low-cost, interactive platform that allows a large number of employees to issue analytical queries? How to give employees the capabilities to analyze big data, if they have little knowledge of SQL or MapReduce programming? questions

168. 167 Future Directions Tooling for Search Quality Maintenance Most enterprise search engines have to be manually evaluated and tuned by a search administrator with domain knowledge, in an ad-hoc fashion. observations Can we automate this process, or at least minimize manual involvement? Can we fully utilize explicit user feedbacks? Explicit user feedbacks are easier to obtain in enterprise search, and there are less spams. questions

169. Thanks. Acknowledgement: IBM Research: Sriram Raghavan, Fred Reiss, Shiv Vaithyanathan, Ron Fagin IBM CIOs Office: Nicole Dri, Brian C. Meyer LogicBlox: Benny Kimelfeld* TripAdvisor: Adriano Crestani Campos* Facebook: Zhuowei Bao* NJIT: Yi Chen UNSW: Wei Wang * work done while at IBM