12
Technical White paper - IntelliSearch ESP 2.0 March 2007 1 TECHNICAL WHITE PAPER Enterprise Search Platform one access – any file – any source

Enterprise Search Platform

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

1

TECHNICAL WHITE PAPER

Enterprise Search Platform

one access – any file – any source

Page 2: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

2

Table of Contents

Introduction ................................................................................................................................ 3

Platform architecture – Open, modular, scaleable ....................................................................... 4

IntelliSearch ESP architecture ................................................................................................ 4

Performance........................................................................................................................... 6

Platform Search technology ........................................................................................................ 7

Platform Administration tools ...................................................................................................... 8

Administration - Search Engine............................................................................................... 8

Administration – Reporting...................................................................................................... 9

Other technical issues ................................................................................................................ 9

End-user Interaction integrations ............................................................................................ 9

Multi-tier Search Architecture................................................................................................ 10

Available file formats ............................................................................................................ 11

System Requirements .............................................................................................................. 12

Software............................................................................................................................... 12

Hardware ............................................................................................................................. 12

Page 3: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

3

Introduction IntelliSearch’s award winning platform allows enterprises access to information in any file, any

file-server, any mail server, any application, or any website. IntelliSearch ESP processes all forms

of structured and unstructured information. The platform has user-friendly user interfaces, with

advanced search and monitoring techniques, to ensure quick and relevant search and results.

The platform enables distribution of information to a large number of channels such as web,

email, sms and more. Distribution is configurable on a regular or ad-hoc basis.

IntelliSearch Enterprise Search Platform (ESP) is a standalone search solution that securely

covers all enterprise sources, and is easy to use and deploy. IntelliSearch deploys a combination

of technologies to enable a contextual understanding of text, Web pages, e-mails, documents and

people's interests - including all formats on all platforms – and offers a unique solution to a

growing number of applications and a host of platforms and devices that are increasingly

dependent on utilizing unstructured information.

IntelliSearch ESP can power any application dependent on finding and analyzing unstructured

information. IntelliSearch ESP is built to provide:

• Accuracy

• Speed and performance

• Scalability

• Security

• Language Independence

• Easy integration

• Support for any content format

IntelliSearch can to power any application dependent upon unstructured information including:

• Business Intelligence

• Content Publishing

• E-Commerce

• Electronic Customer Relationship Management

• ERP / Custom application

• Enterprise information portals

• Internet Portals

• Knowledge Management

• Online Publishing

This documents describes the IntelliSearch ESP technical architecture.

Page 4: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

4

Platform architecture – Open, modular, scaleable

IntelliSearch ESP architecture

IntelliSearch ESP is modular and scalable platform that is written in .NET. The platform can be

set up to access any file of any format and from any internal and external source. The Search

user interfaces can be integrated into any 3rd party application through web services. IntelliSearch

provides indexing of the following sources:

• File servers

• Mail servers

• Portals

• Standard applications

• Customs applications

• Databases (ODBC)

• Meta data

• Disks

• Public websites

• Password protected websites

• External newsfeeds

It comes with its own user interface and administration tools. The platform is depicted below:

Files/-Documents

Databases

Internet

Media

CustomApplications

Query

Results

Alert

VerticalApplications

Portals

MobileDevices

PipelineQUERY /RESULT

PROCESSING

PipelineFILTER

SEARCH

ALERT

FILE

PROCESSING

Pipeline

CO

NT

EN

T A

PI

QU

ER

Y A

PI

MANAGEMENT & APPLICATION SERVICES

SECURITY ACCESS

Deployment Business Application Administration

TOOLS & TOOL BUILDING FRAMEWORK

Custom

DATABASECONNECTOR

FILECONVERTER

WEB

CRAWLER

TUNING

INTELLISEARCH PLATFORMINTELLISEARCH PLATFORM

Files/-Documents

Databases

Internet

Media

CustomApplications

Query

Results

Alert

VerticalApplications

Portals

MobileDevices

PipelineQUERY /RESULT

PROCESSING

PipelineFILTER

SEARCH

ALERT

FILE

PROCESSING

Pipeline

CO

NT

EN

T A

PI

QU

ER

Y A

PI

MANAGEMENT & APPLICATION SERVICES

SECURITY ACCESS

Deployment Business Application Administration

TOOLS & TOOL BUILDING FRAMEWORK

Custom

DATABASECONNECTOR

FILECONVERTER

WEB

CRAWLER

TUNING

PipelineQUERY /RESULT

PROCESSING

PipelineFILTERFILTER

SEARCHSEARCH

ALERTALERT

FILE

PROCESSING

FILE

PROCESSING

PipelinePipeline

CO

NT

EN

T A

PI

QU

ER

Y A

PI

MANAGEMENT & APPLICATION SERVICES

SECURITY ACCESS

Deployment Business Application Administration

TOOLS & TOOL BUILDING FRAMEWORK

Custom

DATABASECONNECTORDATABASECONNECTOR

FILECONVERTER

FILECONVERTER

WEB

CRAWLER

WEB

CRAWLER

TUNINGTUNING

INTELLISEARCH PLATFORMINTELLISEARCH PLATFORM

The IntelliSearch ESP consists of 12 main components. Each of these is described below:

Page 5: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

5

Content APIs (connectors) represent a family of built-in connectors. Connectors are ready-

made interfaces to third party systems built on our generic content API. The connector family

provides access to documents that reside in the external proprietary systems and applications.

Examples of built-in connectors are: Windows NT Filesystems (NTFS), EMC Documentum

Content Server, IBM Lotus Notes and Microsoft Exchange. All connectors are pre-configured

(additional licensing may be required for some of the above connectors). IntelliSearch actively

monitors the market for popular applications and have the objective of supporting all such 3rd

party applications. In addition, our indexing interface allows customers and system integrators to

develop their own connectors to proprietary systems that may exist within the organisation.

Query APIs provide search interface to external applications and devices. The platform can

interface to any application through Web Services/SOAP or HTTP/Post, and have a special built

interface for mobile devices.

Web Crawler is a process activated to a set schedule. The crawler access any web page e.g.

XML, HTML, WML. When activated, the crawler spawns a configurable number of processor

threads that fetch documents from various data sources. Whenever the crawler encounters

embedded, non-HTML documents during the crawling, it uses filters to automatically detect the

document type and to filter and index the document.

File Converter is based on Microsoft iFilter that enables indexing of most popular file formats.

Examples are pdf, xls, ppt, doc, jpeg etc. A complete list is provided in a separate section. The

IntelliSearch ESP also has a built in OCR converter, that enables OCR conversion on-the-fly.

File processor analyze and index content to make it searchable. It converts and process content

through pre-processing pipeline consisting of tokenization, spell checking, stemming, dictionaries,

vectorization and custom dictionary. How it works is described in a separate section.

Search User Interface is an out-of-the-box user interface. It also provides a web services API for

building custom applications for querying indexed data, and contains interfaces for Basic Search

Form, Advanced Search Form, Query Result Display, authentication and authorization, and so

on.

Alert Manager is an out-of-the-box user interface enabling the end user to set personal alerts.

Filter is the security mechanism that returns only the result that the end-user is allowed to see.

Page 6: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

6

Result processing is the process responsible of returning the result to the end-user. It converts

and processes results through result pipeline. Tasks includes organization for categorization,

auto-clustering, dynamic drill-down, pass results on to application, push the results to alert engine

and then external environment (e.g. mail, queue)

Tuning and administration is where the administrator set up the search parameters such as

relevance and prioritization. Examples are absolute and relative query boosting, relative

document boosting, custom processing logic (pre-index, query). The administration tool is a

browser-based application that you use to configure and schedule the crawler, configure the

server, run several reporting features.

Security. IntelliSearch ESP unique combination of sophisticated mathematical algorithms

automates processing and conceptual analysis of large volumes content without sacrificing critical

security aspects. IntelliSearch ESP provides three basic forms of security. These are:

- Authentication: This governs who is able to log in to the system. IntelliSearch ESP

allows direct connections to the preferred authentication directory, such as Notes,

Active Directory, LDAP, Exchange, Netware or Oracle.

- Entitlement: This governs which items in the results list can be seen by the user. In

all corporate environments it is essential that underlying security entitlement models

be respected. IntelliSearch remains synchronized with all underlying security models.

Updates and changes are immediately reflected in the IntelliSearch entitlement

model.

- Authorization: This governs who is able to view documents having clicked on the

links in the results list and is not required with entitlement.

Performance

The IntelliSearch ESP represents a high performance, scaleable platform. Below are the current

platform performance figures:

• Up to 50mill documents on one server*

• Number of users – greater than 1,000 Queries per second

• Latency: Less than 1 sec data input and query latency

* Hardware configuration dependent - see System requirement in separate section

Page 7: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

7

Platform Search technology

For the user, Search is all about speed and relevance. Technically it is about text strings, how

they are interpreted by the search engine, and how the search result is presented to the user.

The IntelliSearch ESP use advanced search functionality to find all relevant documentation

independent of misspellings, use of synonyms, and stemming. The IntelliSearch ESP enables

keyword and relevant search, and allows for automatically switching between the two. Keyword

search is simple search, while relevant search use a statistical algorithm that looks for text

uniqueness and finds matching relevant documents. ESP search supports exact matches,

wildcards, paragraphs, integer, Boolean expressions and truncation.

This combined with unlimited text strings, enables for precise search results. Other advanced

search mechanisms that improves the search results are spell checks, use of base forms of a

word, use of synonyms and dictionaries. The process for matching a search string to the search

engine’s index is shown below:

TOKENIZER

SPELL

CHECKER

BASEFORM

REDUCTION

SYNONYMS

VECTORI-

ZATION

CUSTOM

DICTION-

ARIES

Stemming + Synonym:Reduction to base form,

represented symbolically:

Thesaurus support- for narrower & broader

terms

-Norsk - nynorsk

TokenizerEnsure correct treatment of characters

- e.g. on demand: no lower casing

Adaptive Query

Evaluation

Ranking profiles

Geo position

Adaptive Query

Evaluation

Relevance- Applying vectorization for

relevance indexing

THE SEARCH STRING PROCESSINGTHE SEARCH STRING PROCESSING

TOKENIZER

SPELL

CHECKER

BASEFORM

REDUCTION

SYNONYMS

VECTORI-

ZATION

CUSTOM

DICTION-

ARIES

Stemming + Synonym:Reduction to base form,

represented symbolically:

Thesaurus support- for narrower & broader

terms

-Norsk - nynorsk

TokenizerEnsure correct treatment of characters

- e.g. on demand: no lower casing

Adaptive Query

Evaluation

Ranking profiles

Geo position

Adaptive Query

Evaluation

Relevance- Applying vectorization for

relevance indexing

THE SEARCH STRING PROCESSINGTHE SEARCH STRING PROCESSING

Page 8: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

8

Platform Administration tools

Administration - Search Engine

The administration is conducted by 5 user groups – End user, advertiser, business manager,

administrator and developer. IntelliSearch ESP comes with a number of functionality to suit each

user groups need for administration. Depending on the group - the administration access is

provided in the end-user interface, in an administrator tool, and in a developer tool.

The multiple levels of administration for various users are shown below:

• Sorting

• Navigation

• Feedback

• Alerts

• Media windows

• Banner upload & positioning

• Keyword ads

• Editing of information page

Control Mechanisms

• Profile

• Security settings

• Manual data cleansing

Business Rules

User Profiles

Core Algorithmic Model

Application Model

USER GROUP

End Users

Advertizer

Business

Managers• Alert Parameters

• Boosting/ Priortization

• Ad/Banner/Keyword pricing

Administrator

Developer • Algorithm “weights”

• Categorization

Co

ntr

ol

lev

els

Multiple levels of control

ADMINISTRATION FRAMEWORKADMINISTRATION FRAMEWORK

• Sorting

• Navigation

• Feedback

• Alerts

• Media windows

• Banner upload & positioning

• Keyword ads

• Editing of information page

Control Mechanisms

• Profile

• Security settings

• Manual data cleansing

Business Rules

User Profiles

Core Algorithmic Model

Application Model

USER GROUP

End Users

Advertizer

Business

Managers• Alert Parameters

• Boosting/ Priortization

• Ad/Banner/Keyword pricing

Administrator

Developer • Algorithm “weights”

• Categorization

Co

ntr

ol

lev

els

Multiple levels of control

ADMINISTRATION FRAMEWORKADMINISTRATION FRAMEWORK

Business Rules

User Profiles

Core Algorithmic Model

Application Model

USER GROUP

End Users

Advertizer

Business

Managers• Alert Parameters

• Boosting/ Priortization

• Ad/Banner/Keyword pricing

Administrator

Developer • Algorithm “weights”

• Categorization

Co

ntr

ol

lev

els

Multiple levels of control

ADMINISTRATION FRAMEWORKADMINISTRATION FRAMEWORK

For the administrator a tool is provided to set the following:

• Define and crawl data sources.

• Define crawler parameters like URL boundary rules, crawling depth, proxy settings, etc.

• Create and modify schedules for the crawler.

• Set query options - Query options allow users to limit their searches. Searches can be

limited to document attributes (e.g. title, author) and data groups. Data source groups are

logical entities exposed to the search engine user.

• Adjust relevancy ranking of the search hit list . ESP allows administrators to influence the

order that documents are ranked in the search hit list. Use this to promote important

documents to higher scores and make them easier to find.

• Define suggested links for specific search terms.

Page 9: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

9

• Define alternative words for specific search terms.

• Setup authentication mechanisms for certain data sources.

Administration – Reporting IntelliSearch provides tools to capture the following statistics:

• All search-strings

• Samples of available statistics are:

– Top searches

– Searches with no results

– Vendor/product/services search statistics

– Click through stats per vendor

– Correlation between ranking and click-throughs

– Banner ad showing

– Call-me button statistics

Data extraction is configurable, and IntelliSearch ESP offers export to Excel and all analysis tools

through web-services and XML

Other technical issues

End-user Interaction integrations

IntelliSearch ESP provides the possibility of setting up the following interaction option in the

search result:

• Company links

• Text messaging (mobile)

• V-card

• Call-back button

• Integrated chat room

• Integrated discussion forums

• Integrated feedback system

Page 10: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

10

Multi-tier Search Architecture

IntelliSearch ESP supports a multi-tier architecture. Customers in a multi-continent environment

may choose to setup separate physical search cluster for performance reasons. IntelliSearch

ESP provides multi-index support to support multiple search centres. This enables a superb end-

user search experience in a global company without sacrificing relevancy and freshness.

IntelliSearch ESP provides index synchronization at a regular basis at frequencies set by the

customer. Below is an example of a multi-tier architecture.

Clients

Search Cluster

Client Handler

Search Server

Client Handler

Search Server

Search Server

Other

Search Clusters

Other

Search Clusters

Core Services Host

Database

A

Database

B

Database

C

Database

D

Clients

Search Cluster

Client Handler

Search Server

Client Handler

Search Server

Search Server

Other

Search Clusters

Other

Search Clusters

Core Services Host

Database

A

Database

B

Database

C

Database

D

Customer

data

Clients

Search Cluster

Client Handler

Search Server

Client Handler

Search Server

Search Server

Other

Search Clusters

Other

Search Clusters

Core Services Host

Database

A

Database

B

Database

C

Database

D

Clients

Search Cluster

Client Handler

Search Server

Client Handler

Search Server

Search Server

Other

Search Clusters

Other

Search Clusters

Core Services Host

Database

A

Database

B

Database

C

Database

D

Customer

data

Page 11: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

11

Available file formats

When creating the index, the IntelliSearch platform uses the Microsoft iFilter interface

to extract text and property information from files. The filtering interface extracts

chunks of text from documents, filtering out embedded formatting and retaining

information about the position of the text. It also extracts chunks of values, which are

properties of an entire document or of well-defined parts of a document.

The following file formats are available for indexing:

Available Filters included in IntelliSearch Microsoft Office Word Microsoft Office Excel Microsoft Office PowerPoint Microsoft Office Visio HTML XML RTF - Rich-Text Format Text WordPad Adobe Acrobat PDF Word Perfect 8 JPEG Filter DjVu MP3 Microsoft Scheduler+ News NNTP

Other filters available at a charge: Flash Open Office Microsoft Project SolidWorks Pro/Engineering vCard XMP - JPEG, GIF, TIFF, PNG, PS, EPS, PSD, AI og SVG. Mail MSG filer AutoCad 2002 Windows Media/Audio AutoCad Coreldraw Pro Engineering Visio 2002

Archive formats: ZIP, SFX, SPLIT ZIP, JAR,JAR SFX, CAB, LHA, LHA SFX, LZH, LZH SFX, GZIP, TAR, TZ, TAZ, TGZ, UUE/XXE/ENC.

Any other formats not on this list can be delivered on demand.

Page 12: Enterprise Search Platform

Technical White paper - IntelliSearch ESP 2.0

March 2007

12

System Requirements

When choosing hardware for your IntelliSearch ESP server, please follow the specifications as

given in this brief document. Before purchasing or installing server please read the

implementations guide

Software

Operating System Windows 2003 R2 64Bit

File System NTFS Computer Role Domain member IP Address Fixed Applications MySql or MS SQL Server 2005 (for statistics

only), .net Framework 3.0, Lotus Notes Client (for Lotus Notes indexing)

Hardware

When choosing what hardware to use the most important parameters to consider are

• The number of documents

• The number of connectors

• File size, Large text documents require more disk space

• Total number of uses, the total number of users is a substantial factor. One CPU can

handle roughly 100 queries per second. Scaling for thousands of users requires several CPUs

Recommended Minimum

Documents (K) CPU* Memory Hard Drive CPU Memory Hard Drive 0-50 2,5 GHz 1 GB 50 GB 2 GHz 512 KB 50 GB 50 -100 3 GHz 2 GB 75 GB 2,5 GHz 1 GB 75 GB 100-500 3 GHz 3 GB 100-200GB 2,5 GHz 1,5 GB 100-200GB 500-1000 2*3 GHz 4 GB 200-250GB 3 GHz 2 GB 200-250GB 1000 – 2000 2*3,5 GHz 6 GB 250-400GB 2*3 GHz 3 GB 250-400GB 2000 – 5000 4*4 GHz 8 GB 400 –500G 4*3,5 GHz 4 GB 400 –500G 5000 – 50.000 4*4 GHz 16 GB 500G+

All CPUS require 64Bit capability, either x64 or ia64. The Hard drive performance is a major

factor for search engine performance. IntelliSearch recommends SAS or SCSI drives running at

10K Rpm or faster. There is no performance gain with Striping (Raid) solutions. IntelliSearch do

not support virtual machines in a production environment.

For further information please contact [email protected]

* Actual CPU frequencies depends on processor family