Google search vs Solr search for Enterprise search

04/18/2023Presented byVeera Shekar G

Google Search VS Advanced Search (Enterprise Search implemtation)

11/05/2015

04/18/2023

•A Normal Search engine processes.•You will understand how search Engine Works.• I am beginner at this subject. •5 Top requirements for Effective Enterprise search implementation.•Problem with implementations.

Introduction

11/05/2015

04/18/2023

•Topic 1: How Search engine works.▫Will see architecture and component details.

•Topic 2: Google Search.▫Phases of implementation. Indexing architecture.

•Topic 3: Top 5 requirements for implementing Enterprise search.▫Options available for implementations.

Session Outline

11/05/2015

04/18/2023

•A Normal Search Engine Architecture.•Architecture of a search engine factors determined .• Indexing Process.

Topic 1: Objectives

11/05/2015

04/18/2023

•Architecture of a search engine can be viewed as 2 Layered

Topic 1: Content – Normal Search engine Architecture

11/05/2015

04/18/2023

•Architecture of a search engine determined by 2 requirements –

effectiveness (quality of results) efficiency (response time and throughput)

Topic 1: Content - Factors

11/05/2015

04/18/2023

• Text acquisition –identifies and stores documents for indexing.• Text transformation –transforms documents into index terms or features • Index creation –takes index terms and creates data structures

(indexes) to support fast searching

Topic 1: Content

11/05/2015

04/18/2023

•Search engine will have two main processes Indexing process and Querying Process.

•Questions?

Topic 1: Wrap-up

11/05/2015

04/18/202311/05/2015

•High Level Architecture of Google search.•Web Crawlers.•Technologies Used.

Topic 2: Google Search

04/18/202311/05/2015

Topic 2: Content - High Level Architecture

04/18/202311/05/2015

•A web crawler is a program that, given one or more seed URLs, downloads the web pages associated with these URLs, extracts any hyperlinks contained in them.

•Recursively continues to download the web pages identified by these hyperlinks. Web crawlers are an important component of web search engines, where they are used to collect the corpus of web pages indexed by the search engine.

Topic 2: Content - Web Crawlers

04/18/202311/05/2015

•Google visualizes their infrastructure as a three layer stack:•Products: search, advertising, email, maps, video, chat, blogger•Distributed Systems Infrastructure: GFS, MapReduce, and BigTable.•Computing Platforms: a bunch of machines in a bunch of different data

centers

•Make sure easy for folks in the company to deploy at a low cost.• Look at price performance data on a per application basis. Spend more

money on hardware to not lose log data, but spend less on other types of data. Having said that, they don't lose data.

Topic 2: Content – Technologies Stack

04/18/202311/05/2015

•Google Technology stack.•Web-crawlers.

Topic 2: Wrap-up

04/18/202311/05/2015

•Top 5 requirements for implementing Enterprise search.•Options available at each requirement.

Topic 3: Objectives

04/18/202311/05/2015

• Diverse Content: Ability to crawl, index and search diverse content repository. The Web, Microsoft SQL database and SharePoint content management systems.• Secured Search: Ability to crawl secured content and make it accessible to only authorized people

and/or groups. Single sign-on, forms-based authentication.• User Interface: Ability to provide various user interface (UI) components to serve end users with

precise results. Guided navigation, related search terms, related articles and best bets. AutoSuggest with terms combined from real-time search and custom (user configurable) terms in data stores• Desktop Search: Ability to integrate with content stored in the desktop.• Social Search: Ability to find other people, ratings and expertise within the organization.

Topic 3: Content - Top 5 requirements for implementing Enterprise search

04/18/202311/05/2015

• Google Web crawler for crawling and indexing Web content (GOOTB). • Google DB connector for crawling and indexing Microsoft SQL database (GOOTB).• Google SharePoint connector for crawling and indexing SharePoint content (GOOTB).• Google forms authentication for index time authorization and serve time authentication

(GOOTB).• Google front-end configuration for: > Faceted search, aka guided navigation (limited OOTB). > Related search terms (GOOTB). > Related articles (GOOTB). > Best bets (GOOTB). > Autosuggest (GOOTB and custom application).• Google desktop search component integration (external Google component).• Google results integration with internal rating system

Topic 3: Content – Google implementing requirements

04/18/202311/05/2015

04/18/202311/05/2015

•Google Web Crawler.•Disadvantage: As efficient and good as it sounds, one disadvantage of

Web crawler is Google’s inability to reveal the exact page that is currently being processed.

•Alternative: The OS console monitor and/ or tracking log files are some ways that could help track URL crawl status.

•At any point of time, a developer should be able to view the current URL being crawled and issues faced (if any) with security. Almost all tools provide this feature – such as Solr, FAST, Endeca and Autonomy.

Topic 3: Content – Web crawler

04/18/202311/05/2015

• Database Connector.• Disadvantage: Google’s inability to allow end implementers to schedule DB crawl Poor diagnostics for connector/XML-fed content. Google’s way of removing content from index is quite primitive and time-consuming.

• Alternative: Alternative: Compared to GSA, It found Apache Solr is a better option for indexing the database via data import handler.

• Solr provides an effective way to remove content from the index, either via the admin console or via XML import (/update with delete option).

Topic 3: Content – Database Connector

04/18/202311/05/2015

•Google provides connectors to very few CMS systems out of the box.•Disadvantage: Even if Google is executing a bulk late binding, performance issues at query time are inevitable when the document volume is high.

•Alternative: One alternate is to consider the site/page/document level security as an additional metadata, develop an application that would post-filter the results based on end-user security attributes. This is again a primitive method and has its own disadvantages in terms of query time latency.

Topic 3: Content – SharePoint Connector (for Document Management system)

04/18/202311/05/2015

• At query time, Google uses the query time configuration to make an HEAD request that would allow the logged-in user (within a specific domain) to view only the content that he is authorized to view

.• Disadvantage: This late binding security model has performance degradation is inevitable with higher QPS and/or higher results count.

• Alternative: There are tools that support an early binding security model that allows the search engine to cache the user security groups along with the content.

Topic 3: Content – Forms Authentication

04/18/202311/05/2015

•One disadvantage with Apache Solr is that it does not handle secured content. The only way to serve secured content is to store the security tags/groups as one of the metadata and implement a field (or metadata) constrained search.

•That is were ACL’s come into picture.

Note

04/18/202311/05/2015

• GSA provides an open source component called “search-as-you-type” which allows end implementers to fetch real-time results from the appliance.

• Disadvantage: Onebox modules are designed to respond within one second. This could result in no results from TermFederator if there is any delay at the database.• Alternative: “TermComponent” in Apache Solr is an effective autosuggest

tool. Terms stored in any local text file can be made available to Solr at startup. A separate component designed to merge alphabetically.

Topic 3: Content – Auto Suggest

04/18/202311/05/2015

• Best Bets — aka Keymatches, aka AdWords. • Related search terms same as synonyms.• Faceted search, aka Guided Navigation: GSA does not support faceted search. But

this feature can be achieved via metadata constrained search at query time, similar to how it is implemented in Solr.

• Disadvantage: Facet count in GSA is not available OOTB. • Alternative: Faceted search is one of Apache Solr’s strongest features and is

implemented within many e-commerce Website And (Oracle) Endeca and (HP) Autonomy maintain content hierarchy for guided navigation.

Topic 3: Content – User Interface

04/18/202311/05/2015

• InfoValuator component captures end-user rating and saves a combination of user identity, content URI and value rating in the backend data store.

Topic 3: Content – InfoValuator

04/18/202311/05/2015

•There is no one search engine that fulfills all enterprise search requirements. HP Autonomy claims this lofty perch but it comes with a huge cost overhead, with the base cost crossing half a million dollars.

•Google is not the right fit for many requirements that we have seen so far. Custom search application development is inevitable and if well planned, we can basically use any tool in the market to implement enterprise search as a full-fledged application.

Summary of Session

Data & Analytics

Google search vs Solr search for Enterprise search