Upload
louvain
View
51
Download
3
Embed Size (px)
DESCRIPTION
IBM Content Analytics with Enterprise Search. BigInsights Integration. Mar 28 th , 2012. Challenge and Approach. Challenge Achieve massive scale-out Utilize cloud environment as resource pool Approach Keep compatibility with current version to respect existing customers - PowerPoint PPT Presentation
Citation preview
IBM Confidential
IBM Software Group | Yamato Software Development Laboratory
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise Search
BigInsights Integration
Mar 28th, 2012
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Challenge and Approach
Challenge– Achieve massive scale-out– Utilize cloud environment as resource pool
Approach– Keep compatibility with current version to respect existing customers
• No end user impact • Seamless administration
– Utilize current assets• UIMA Infrastructure• UIMA Annotators (LW, System-T, Takmi,…)• Various data source crawlers• …
– Utilize BigInsights as scale-out infrastructure
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Seamless Scale-out Scenario
ICA V3.0 offers 3 types of system configuration according to the volume of data
POC with small data can be done on a single workstation
Production system will be deployed to 1 to N servers
Production system analyzing big data will utilize BigInsights
* BigInsights is supported only on Linux
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
About InfoSphere BigInsights
IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise. … BigInsights enhances this technology
to withstand the demands of your enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research….
InfoSphere BigInsights is prereq– Version 1.3 is officially supported
4
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Feature Overview: Collection on BigInsights
Search & Text Analytics Capability– UIMA
– System-T
– Gumshoe Scale Out
– IBM Hadoop
– ILEL BigIndex Flexible Job Flow
– Orchestrator (a.k.a. MetaTracker) Easy Data Manipulation
– JAQL Robust File System
– GPFS (Shared Nothing Cluster version, not yet released)
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
ICA V3.0 Analytics Flow on BigInsights
Crawler
Importer
Text Analytics / SearchRuntime
Exporter
Document Processing Flow
Indexing Service Process
GlobalAnalysis
Local Analysis(UIMA base)
Document Processing Flow
IBM InfoSphere IBM InfoSphere BigInsightsBigInsights
Reg
ula
r O
S
Various Data source
Other App.
UI
Slave Index
IBM Content AnalyticsIBM Content Analytics
Pre-Processing
UIMA Analysis
System-T Analysis
- Gumshoe LA- Gumshoe GA
Indexing ICA GA
Job Flow controlledby Orchestrator(MetaTracker)
Operationby JAQL
Custom Data
HDFS/GPFS
UIMA Annotators- LanguageWare- TAKMI- User Custom
RDS Cache
Orchestrator
Job RequestJob Request
BigIndexBigIndex
-Link Analysis- Dup Doc Elimination- Facet Grouping- Custom GA
-Gumshoe Relevancy
RDS
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
How Jaql and Hadoop Map/Reduce works with ICA
Example: Omit duplicated documents in RDS by Jaql/Hadoop
parseRds=fn(rdsFiles:[{path:string,offset:long}],options,output:schematype=null) (
mapReduce({
input:rdsFileDescriptor(rdsFiles[*].path,keepRemoved=false),
output:(if(isnull(output))(HadoopTemp())else(output)),
map:fn(v) (v->transform [$.uri,$]),reduce: fn(k,v) (
d=v->sort by [$.sequenceNumber desc],
if( d[0].code > 0 and d[0].code != 4050)([d[0]])else([])
),
…
7
RDS{uri=“A”,seqno=0,…}{uri=“B” seqno=1,…},…
Input FormatKey=“A”, value={uri=“A”, seqno=0,…}Key=“B”, value={uri=“B”, seqno=1,…}…
Map Reduce
Key=“A”, Values=[ {uri=“A”, seqno=100,,…}, {uri=“A”, seqno=0,…}]
RDS{uri=“A”,seqno=100,…}{uri=“C” seqno=101,…},…
Input FormatKey=“A”, value={uri=“A”, seqno=100,…}Key=“C”, value={uri=“C”, seqno=101,…}…
Map ReduceKey=“B”,Values=[ {uri=“B”, seqno=1,…}]
Key=“C”,Values=[ {uri=“C”, seqno=101,…}]
{uri=“A”,seqno=100,…}…
Output Format
{uri=“B”,seqno=1,…}{uri=“C” seqno=101,…},…
Output Format
Json
Json
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Differences : In general
Regular collection BigInsights collection
Time to refresh index Quick Lazy
Scalability Under 10 servers Over 100 servers
Flexibility System must have peak capacity System resource can allocate as required
Best for the use caseDocuments are continuously added/removed/updatedCan have powerful server
Large number of documents are processed at once Already have BigInsightsNeeds flexibility
8
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Easy Configuration
Specify BigInsights Sever Information Admin user can confirm the setting on Topology View
Specify “Use IBM BigInsights” while creating a collection– Then configuration files and ICA libraries, UIMA PEARs (including custom PEAR) and other
required modules will be distributed to BIgInsights servers automatically
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Storage requirement with BigInsights
10
ICA ES_NODE_ROOT should be shared on all nodes
to share configuration and other resoureces BigInsights
Jaql and Map/Reduce uses local storage as temporary storage
HDFS will also uses local storage as a part of the file system
BigIndex also consumes local storage to merge indexes It is strongly suggested to use GPFS with fibre as storage in
performance/reliability reasons for small cluster
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Storage requirement with BigInsights : HDFS
11
HDFS– Storage on each data node will used as a part of file system
– Can increase capacity by adding storage on each data node or adding new data node with storage
– Have replication of each blocks ( default : 3 )
– Each searcher process downloads index from HDFS to local file system
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Storage requirement with BigInsights :Shared storage
12
Shared storage– High performance storage (i.e. GPFS with fibre) will be required
– Each searcher must share the storage
– ICA servers should use same storage as ES_NODE_ROOT
– Using NFS has some requirement, please check release note
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Custom Global AnalysisCustom Global Analysis
IBM Confidential
IBM Software Group | YSL
IBM Content Analytics with Enterprise SearchIBM Content Analytics with Enterprise Search
Custom Global Analysis by JAQL
Global Analysis– Obtain new information by examining all documents in a collection
• Link Counting• Duplicated Document Detection• etc
Custom Global Analysis by JAQL– User can integrate his own Global Analysis logic using JAQL
• Input is the result of ICA document processing (field, facet, content)• Output can be stored as a document field or facet
User Benefits– New data manipulation point across documents
• Crawler plug-in, UIMA Annotator can manipulate data only within each document
– Manipulate data using Map/Reduce from script like SQL• JAQL releases developers from Java programming of Map/Reduce