14
IBM Confidential Laboratory IBM Confidential IBM Software Group | YSL IBM Content Analytics with Enterprise Search BigInsights Integration Mar 28 th , 2012

IBM Content Analytics with Enterprise Search

  • Upload
    louvain

  • View
    51

  • Download
    3

Embed Size (px)

DESCRIPTION

IBM Content Analytics with Enterprise Search. BigInsights Integration. Mar 28 th , 2012. Challenge and Approach. Challenge Achieve massive scale-out Utilize cloud environment as resource pool Approach Keep compatibility with current version to respect existing customers - PowerPoint PPT Presentation

Citation preview

IBM Confidential

IBM Software Group | Yamato Software Development Laboratory

IBM Confidential

IBM Software Group |   YSL

IBM Content Analytics with Enterprise Search

BigInsights Integration

Mar 28th, 2012

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Challenge and Approach

Challenge– Achieve massive scale-out– Utilize cloud environment as resource pool

Approach– Keep compatibility with current version to respect existing customers

• No end user impact • Seamless administration

– Utilize current assets• UIMA Infrastructure• UIMA Annotators (LW, System-T, Takmi,…)• Various data source crawlers• …

– Utilize BigInsights as scale-out infrastructure

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Seamless Scale-out Scenario

ICA V3.0 offers 3 types of system configuration according to the volume of data

POC with small data can be done on a single workstation

Production system will be deployed to 1 to N servers

Production system analyzing big data will utilize BigInsights

* BigInsights is supported only on Linux

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

About InfoSphere BigInsights

IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. … BigInsights enhances this technology

to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research….

InfoSphere BigInsights is prereq– Version 1.3 is officially supported

4

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Feature Overview: Collection on BigInsights

Search & Text Analytics Capability– UIMA

– System-T

– Gumshoe Scale Out

– IBM Hadoop

– ILEL BigIndex Flexible Job Flow

– Orchestrator (a.k.a. MetaTracker) Easy Data Manipulation

– JAQL Robust File System

– GPFS (Shared Nothing Cluster version, not yet released)

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

ICA V3.0 Analytics Flow on BigInsights

Crawler

Importer

Text Analytics / SearchRuntime

Exporter

Document Processing Flow

Indexing Service Process

GlobalAnalysis

Local Analysis(UIMA base)

Document Processing Flow

IBM InfoSphere IBM InfoSphere BigInsightsBigInsights

Reg

ula

r O

S

Various Data source

Other App. 

UI

Slave Index 

IBM Content AnalyticsIBM Content Analytics

Pre-Processing

UIMA Analysis

System-T Analysis

- Gumshoe LA- Gumshoe GA

Indexing ICA GA

Job Flow controlledby Orchestrator(MetaTracker)

Operationby JAQL

Custom Data 

HDFS/GPFS

UIMA Annotators- LanguageWare- TAKMI- User Custom

RDS Cache

Orchestrator

Job RequestJob Request

BigIndexBigIndex

-Link Analysis- Dup Doc Elimination- Facet Grouping- Custom GA

-Gumshoe Relevancy  

RDS

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

How Jaql and Hadoop Map/Reduce works with ICA

Example: Omit duplicated documents in RDS by Jaql/Hadoop

parseRds=fn(rdsFiles:[{path:string,offset:long}],options,output:schematype=null) (

mapReduce({

input:rdsFileDescriptor(rdsFiles[*].path,keepRemoved=false),

output:(if(isnull(output))(HadoopTemp())else(output)),

map:fn(v) (v->transform [$.uri,$]),reduce: fn(k,v) (

d=v->sort by [$.sequenceNumber desc],

if( d[0].code > 0 and d[0].code != 4050)([d[0]])else([])

),

7

RDS{uri=“A”,seqno=0,…}{uri=“B” seqno=1,…},…

Input FormatKey=“A”, value={uri=“A”, seqno=0,…}Key=“B”, value={uri=“B”, seqno=1,…}…

Map Reduce

Key=“A”, Values=[ {uri=“A”, seqno=100,,…}, {uri=“A”, seqno=0,…}]

RDS{uri=“A”,seqno=100,…}{uri=“C” seqno=101,…},…

Input FormatKey=“A”, value={uri=“A”, seqno=100,…}Key=“C”, value={uri=“C”, seqno=101,…}…

Map ReduceKey=“B”,Values=[ {uri=“B”, seqno=1,…}]

Key=“C”,Values=[ {uri=“C”, seqno=101,…}]

{uri=“A”,seqno=100,…}…

Output Format

{uri=“B”,seqno=1,…}{uri=“C” seqno=101,…},…

Output Format

Json

Json

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Differences : In general

Regular collection BigInsights collection

Time to refresh index Quick Lazy

Scalability Under 10 servers Over 100 servers

Flexibility System must have peak capacity System resource can allocate as required

Best for the use caseDocuments are continuously added/removed/updatedCan have powerful server

Large number of documents are processed at once Already have BigInsightsNeeds flexibility

8

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Easy Configuration

Specify BigInsights Sever Information Admin user can confirm the setting on Topology View

Specify “Use IBM BigInsights” while creating a collection– Then configuration files and ICA libraries, UIMA PEARs (including custom PEAR) and other

required modules will be distributed to BIgInsights servers automatically

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Storage requirement with BigInsights

10

ICA ES_NODE_ROOT should be shared on all nodes

to share configuration and other resoureces BigInsights

Jaql and Map/Reduce uses local storage as temporary storage

HDFS will also uses local storage as a part of the file system

BigIndex also consumes local storage to merge indexes It is strongly suggested to use GPFS with fibre as storage in

performance/reliability reasons for small cluster

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Storage requirement with BigInsights : HDFS

11

HDFS– Storage on each data node will used as a part of file system

– Can increase capacity by adding storage on each data node or adding new data node with storage

– Have replication of each blocks ( default : 3 )

– Each searcher process downloads index from HDFS to local file system

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Storage requirement with BigInsights :Shared storage

12

Shared storage– High performance storage (i.e. GPFS with fibre) will be required

– Each searcher must share the storage

– ICA servers should use same storage as ES_NODE_ROOT

– Using NFS has some requirement, please check release note

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Custom Global AnalysisCustom Global Analysis

IBM Confidential

IBM Software Group | YSL

IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search

Custom Global Analysis by JAQL

Global Analysis– Obtain new information by examining all documents in a collection

• Link Counting• Duplicated Document Detection• etc

Custom Global Analysis by JAQL– User can integrate his own Global Analysis logic using JAQL

• Input is the result of ICA document processing (field, facet, content)• Output can be stored as a document field or facet

User Benefits– New data manipulation point across documents

• Crawler plug-in, UIMA Annotator can manipulate data only within each document

– Manipulate data using Map/Reduce from script like SQL• JAQL releases developers from Java programming of Map/Reduce