47
Voquette Company Confidential SCORE

Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Embed Size (px)

Citation preview

Page 1: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

SCORE

Page 2: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Presentation Overview

• Industry Requirements

• Capabilities

• System Architecture and Technologies

• Examples and Scenarios

• Measures (Quality, Performance, Scalability, Robustness)

• Deployment Information

• Questions & Answers: What if

• Business Development Issues

• Milestones and Schedules

Page 3: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

1. The Problem: massive, disparate information

• Multiple isolated sources of intelligence information (FBI, CIA, etc.) that is not

shared or integrated

• Large variety (format, media) of open source, partner, FAA and IC information

2. The Difficulty: inability to have timely actionable info

• Amount of data too overwhelming to use constructively

• Manual methods of aggregating data not scaleable

=> Lack of a “complete picture” to make decisions

• Inability to make timely, accurate and actionable conclusions based on information-

at-hand

3. The Solution: Voquette’s Semantic Technology

• Technology to analyze and integrate data from disparate sources to provide a near-

real time, reliable, scaleable and actionable solution for intelligence and security

applications

Intelligence Content Management Challenges

Page 4: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

1. Aggregation

• Feed handlers/Agents that understand content representation and media semantics

• Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured data of different

types from proprietary, partner and open source

2. Homogenization and Enhancement

• Enterprise-wide common and customizable view (information organization)

• Domain model, taxonomy/classification, metadata standards

• Semantic Metadata– created automatically if possible

• Semantic associations/inferences (link analysis)

3. Semantic Applications (in near real-time)

• Search, personalization, alerts, knowledge browsing/inference for improved

relevance, intelligent personalization, customization

New Technical Challenges in Enterprise Content Management

Page 5: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette’s Unique Capabilities

• Semantics (understanding of content and user needs)

• Extreme relevance

• Knowledge inferencing (semantic associations)

• Near real-time

• Multiple applications/usage patterns (not just search)

• Automation

• Scalability in all aspects

Page 6: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette Semantic Technology System Architecture

Distributed agents that automatically extract relevantsemantic metadata from structured and unstructured content

Fast main-memory based query engine with APIs and XML output

CACS provides automatic classification (w.r.t. WorldModel)from unstructured text and extracts contextually relevant metadata

Distributed agents that automatically extract/mineknowledge from trusted sources

Toolkit to design and maintain the KnowledgebaseKnowledgebase represents the real-world instantiation(entities and relationships) of the WorldModel

WorldModel specifies enterprise’snormalized view of information (ontology)

Page 7: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Workflow Process

• WorldModel™ (Domain Model), Taxonomy/Classification,

Knowledge base schema

• Classifiers

• Knowledge and Content Extraction Agents

• Automated or human-supervised run-time

(for classification and metadata enhancement, knowledge base

maintenance)

• Semantic Applications

All components support incremental extensions.

Page 8: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Technological Innovation

• Semantic approach (classification/taxonomy, domain model, entities and relationships)

[All components]

• Semantic associations/ knowledge inferences

• Classification committee (multiple technologies, rather than one size fits all) [CACS]

• Scalability throughout with distributed architecture and implementation (number of

content and knowledge sources, indexing, etc.)

• Main memory implementation, incremental check pointing [SSE]

Page 9: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Example:Domain: Intelligence

Sub-domain: People, Org, Places(Other Sub-domains: Financing, Methods & Training, Materials)

Page 10: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette Semantic Technology System Architecture

Page 11: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?WorldModel™: Template infrastructure to organize and index content contextually

What does it consist of?Domains (categories) and domain-specific attributes, with geo-spatial and temporal info

Setting up a Terrorist Intelligence WorldModel™Terrorism

Intelligence

Group

Person

Event

Bank

Attack Material

Name Alias

Terrorism IntelligenceWorldModel™ (simplified)

Alias Email Address

Location

Time

What are the information pieces of possible interest?

(that can be modeled as WorldModel™ attributes)

• Groups: Nationalist, Terrorist, Political groups

• Person: Terrorist, Suicide Bomber, Hijacker, Personality

• Event: Flight hijacking, WTC Crash,Kidnapping, Terrorist training

• Bank: Swiss bank, Belgian bank (where groups have accts)

• Attack Material: Knives, Plastic Explosives, RDX, AK47 Gun

• Name Alias: Aliases of terrorists (Osama BL = Usama BL)

• Alias Email Addresses: Email addresses for alias names

• Location: Location related with event of interest

• Time: Date/time related to event of interest

Intelligence WorldModel™

Page 12: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette Semantic Technology System Architecture

Page 13: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?Extractor Agents: Intelligent software robots that work on structured content and automatically extract metadata information that is relevant and meaningful to the domain/sub-domain at hand

How do they work?• Intelligence extractor agents use the Intelligence WorldModel™ definition for meaningful

metadata extraction from trusted Intelligence content• Extractor agents exploit the structure of Intelligence content and automatically “pick up” meaningful Intelligence metadata information (as defined in the WorldModel™)

ExtractorAgent

ForCIA

ConfidentialContent

Pick up syntax metadata

Pick up group name

Pick up person

Pick up attack material

Pick up bank name

Pick up location/date/time

Metadata extracted

Pick up name aliases

TerrorismIntelligence

Group

Person

Event

Bank

Attack Material

Name Alias

Terrorism IntelligenceWorldModel™

Alias Email Address

Location

Time

Intelligence Extractor Agents

Page 14: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette Semantic Technology System Architecture

Page 15: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?Knowledge Base: Network of Intelligence objects (significant pieces of information) anda representation of the real-world relationships (associations) between them

Group originated in(‘’Al Queda” originated in “Afghanistan”)

Country

Groupworks with

Group

(‘Irish IRA” works with “Columbian Group”)

works for(‘Nabil Almarabh” works for “Al Queda”)

Person

leads(‘Bin Laden” leads “Al Queda”)

Person Group

has alias(‘Bin Laden” has alias “Mohammed”)

Person

has email(‘Mohammed” has email “[email protected]”)

Alias

Alias Email add

involved in(‘Bin Laden” involved in “WTC Crash”)

Person Event

occurred at

(‘WTC Crash” occurred at “New York, USA”)

Event Location

occurred at

(‘WTC Crash” occurred at “0903, 9/11/01”)

TerrorismIntelligence

Group

Person

Event

Bank

Attack Material

Name Alias

Alias Email Address

Location

Time

WorldModel™ Intelligence Knowledge Base Definition

Group

Alias

Person

Country

Bank

Account in

Has alias

Has email

Involved in

Occurred at

Works for/leads

Location Time

EmailAdd

Event

Occurred at

Originated in

Is funded by/works with

Intelligence Knowledge Base

Group accounts in(‘’Al Queda” accounts in “Swiss bank”)

Bank

Group

TimeEvent

Page 16: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette Semantic Technology System Architecture

Page 17: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?CACS: Module that categorizes content and automatically creates metadata of content

How does it work?Uses a hybrid of statistical, machine learning and Intelligence knowledge-base techniques

Structured Intelligencecontent

CACS

Information exchange for metadatacreation

Event: Pentagon Attack

Metadata extracted:Terrorist Group: Al QuedaPerson: Bin LadenLocation: Washington, USATime: 0918 hrs

Affiliation Country: AfghanistanAllied Group: Saudi MisaalPerson Alias: Mohammed

Intelligence Knowledge Base Definition

Group

Alias

Person

Country

BankAccount in

Has alias

Has email

Involved in

Occurred at

Works for/leads

Location Time

EmailAdd

Event

Occurred at

Originated in

Is funded by/works with

Unstructured Intelligencecontent

OR

Application in IntelligenceCACS could be trained to intelligently process Intelligence content to classify the content piece as a terrorism-related event (WTC Crash, Flight hijacking, etc.)

Categorization and Auto-Cataloging System (CACS)

Page 18: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette Semantic Technology System Architecture

Page 19: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?Semantic Engine: Fast main memory-based front end query engine that enables the end-userto retrieve highly relevant and personalized content via custom APIs

Features and Functionality• Minimal input from security agent – system intelligent enough to provide all possible relevant

content to security agent (type in “Bin Laden” and get all relevant information on him and other items related to him)• Applications: Search, personalization, alerts, notifications, directory

SemanticEngine

Search

Personalization

Directory

Alerts/Notifications

Intelligent Inference

Analyst WorkBench

Custom Apps.

ConfidentialAgent

User query submitted

Highly relevant Content returned

ContentEnhancementTechnology

Intelligence Semantic Engine

Page 20: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Scenario 1: Intelligent Analysis of Confidential Email

Page 21: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

• Information underlined in blue are important metadata elements automatically picked up by

the Intelligence extractor agents

• Information shown in red boxes are names of terrorists (stored in our Knowledge Base) that

are also automatically picked up by the Intelligence extractor agents

• CACS can determine by content analysis that this is a “Terrorist Meeting” information

• Intelligent inferencing is possible due to semantic associations of the Knowledge Base

“Mohamed Atta met with Abdulaziz Alomari” Picked up off explicit mention in email

Al Qaeda Saudi Misaal

Works for Works for Voquette Knowledge Associations

Inference: Al Qaeda and Saudi Misaalhave possibly started working togetheras allied groupsOriginated in Originated in

Afghanistan Saudi Arabia Inference: Afghanistan and Saudi Arabia have groups that probably collaborate - look for other relationships

Scenario 1: Intelligent Analysis of Email (Contd.)

Page 22: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Scenario 2: Analyst Workbench

• Voquette’s Semantic Technology enables highly relevant and comprehensive

terrorist research

• Example: A security agent wishes to perform research on “Bin Laden” (as he is prime suspect)

• News/Information directly about Bin Laden is retrieved (that mentions his name explicitly)

• News/Information on Al Qaeda is retrieved (Bin Laden Al Qaeda association in KB)

• News/Information on WTC Crash is retrieved (WTC Crash Bin Laden association in KB)

• News/Information on Mohammed is retrieved (Mohammed Bin Laden ‘alias assoc.’ in KB)

• News/Information (intelligence) on Afghanistan is retrieved (Al Qaeda Afghanistan in KB)

• News/Information (intelligence) on Swiss bank is retrieved (Al Qaeda Swiss bank in KB)

• Combined together, this co-related information is extremely valuable in bringing together

multiple actionable perspectives and point-of-views on one screen

• Result: Less time-spending, faster and much better decision making, more security!

Page 23: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Syntax Metadata

Semantic Metadata

led by

Same entity

Human-assisted inference

Knowledge Inferencing Workflow

Page 24: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Analyst Usage Scenarios/Interfacesfor Knowledge Inference

Analysts can possibly use:

• Search

• Knowledge Base Browser / Directory

• Personalization/Alerts

• APIs for custom applications

All options support Reference Pages, Semantic Associations,

Knowledge-based browsing

Page 25: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Intelligence Analyst Browsing Scenario

Page 26: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Core Competencies of Voquette’s Semantic Technology

Content Aggregation, Integration and Normalization

• Create a Customized WorldModel™ (domain model with customized domain attributes)

• Content Aggregation and integration from multiple sources, formats and media (text/audio/video)

• Support push or pull delivery/ingestion of content

• Patented extractor agent technology

• Metadata extraction from structured, semi-structured and unstructured text (fully automated)

• Automatically homogenize content feed tags (fully automated)

Categorization and Auto-Cataloging

• Automatically categorize structured and unstructured text

• Create contextually relevant semantic metadata from unstructured text (fully automated)

• Uniquely uses a hybrid of statistical, machine learning and knowledge-base techniques for classification

Page 27: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Content Enhancement using Knowledge Base• Create and maintain a Customized Knowledge Base for any domain

• Automatically create content tags based on text Itself (fully automated)

• Automatically enhance content tags based on information outside of text (fully automated) by exploiting Knowledge Base

• Provide end user relevant content not only relevant content he asked for, but also relevant content that he did not explicitly ask for, but that he needs to know

Core Competencies of Voquette’s Semantic Technology

Semantic Engine• Fast , main-memory based Semantic Engine

• Response Time of the order of 10s of milliseconds

• Performance: 1 million queries per hour per server

• Real Time Indexing (stories indexed for search/personalization within a minute)

• Near real-time search/personalization of new content and breaking news

• Information retrieval based on quality and not quantity

• Semantic Applications: Search, Directory, Personalization, Alert, Notifications, Custom enterprise applications

Page 28: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

SCORE Implementation Architecture

Distributed agents that automatically extract relevantsemantic metadata from structured and unstructured content

Fast main-memory based query engine with APIs and XML output

CACS provides automatic classification (w.r.t. WorldModel)from unstructured text and extracts contextually relevant metadata

Distributed agents that automatically extract/mineknowledge from trusted sources

Toolkit to design and maintain the KnowledgebaseKnowledgebase represents the real-world instantiation(entities and relationships) of the WorldModel

WorldModel specifies enterprise’snormalized view of information (ontology)

Page 29: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

ExampleDomain: Financial ServicesSub-domain: Equity Market

(other potential sub-domains: Fixed Income, Mutual Funds, …)

Page 30: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Semantic Metadata

Syntax Metadata

Content Enhancement Workflow

Page 31: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

ExtractorAgent

forBloomberg

Scans text for analysis

Metadataextractedautomatically

AssetSyntax MetadataProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San Jose, CAURL: http://bloomberg.com/1.htmMedia: Text

Semantic Metadata Company: Cisco Systems, Inc.

Creates asset (index)out of extracted metadata

AssetSyntax MetadataProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San Jose, CAURL: http://bloomberg.com/1.htmMedia: Text

Semantic Metadata Company: Cisco Systems, Inc.Topic: Company News

Categorization &Auto-Cataloging System (CACS)

Scans text for analysis

Classifies document into pre-defined category/topic

Appends topic metadatato asset

CiscoSystems

CSCO

NASDAQ

Company

Ticker

Exchange

Industry

Sector

Executives

John ChambersTelecomm.

Computer Hardware

Competition

Nortel Networks

Knowledge Base

CEO of

Competes with

Syntax Metadata AssetProducer: BusinessWireSource: BloombergDate: Sept. 10 2001Location: San Jose, CAURL: http://bloomberg.com/1.htmMedia: Text

Semantic Metadata Company: Cisco Systems, Inc.Topic: Company NewsTicker: CSCOExchange: NASDAQIndustry: Telecomm.Sector: Computer HardwareExecutive: John ChambersCompetition: Nortel NetworksHeadquarters: San Jose, CA

Leveragesknowledgeto enhance

metatagging

Enhanced Content Asset

Indexed

Headquarters

San Jose

XML Feed

SemanticEngine

Content Asset Index Evolution

Page 32: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?WorldModel™: Template infrastructure to organize and index content contextually

What does it consist of?Domains (categories) and domain-specific attributes

Examples

Equity

Company

Ticker

Industry

Sector

Executive

Headquarters

Equity WorldModel™

Definition

Domain: Equity

Equity-specific attributes:CompanyTickerIndustrySectorExecutiveHeadquarters

Sports WorldModel™

Sports

Sport Name

Location

Player

Team

League

Coach

Golf Football

Golfer

Tourney

Golf Course

Definition

Domain: Sports

Sports-specific attributes:Sport NameLocation

Sub-Domain: Golf

Golf-specific attributes:GolferTourneyGolf Course

Sub-Domain: Football

Football-specific attributes:PlayerTeamLeagueCoach

Voquette WorldModel™

Page 33: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?Extractor Agents: Intelligent software robots that work on structured content and automatically extract metadata information that is relevant and meaningful to the domain/sub-domain at hand

How do they work?• Extractor agents use the WorldModel™ definition for metadata extraction• Extractor agents exploit the structure of content and automatically “pick up” meaningful metadata information• Write once, Extract permanently – schedulable according to needs• Can work on Web content, feeds, XML, corporate databases, etc.• Extractor agents specific to structure of content-at-hand

Equity

Company

Ticker

Industry

Sector

Executive

Headquarters

Equity WorldModel™

ExtractorAgent

ForCNNfN

Pick up syntax metadata

Pick up company

Pick up ticker

Pick up industry

Pick up sector

Pick up executives

Metadata extracted

Pick up headquarters

Voquette Extractor Agents

Page 34: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?Knowledge Base: Network of entity objects (significant pieces of information) anda representation of the real-world relationships (associations) between them

What does it consist of?Entities (person, location, organization, etc.) and Entity-Relationships How does it work?• Structured closely to the structure of the WorldModel™• Entity and relationship template definitions for the domain at hand• Work with knowledge extractor agents to collect instances of entities from trusted sources• Automatically create relationships between instances using type definitions

Equity

Company

Ticker

Industry

Sector

Executive

Headquarters

Equity WorldModel™

ExchangeCisco

Systems

CSCO

NASDAQ

Company

Ticker

Exchange

Industry

Sector

Executives

John ChambersTelecomm.

Computer Hardware

Competition

Nortel Networks

Knowledge Base

Competes with

Headquarters

San Jose

CEO of

Equity Knowledge BaseDefinition

Company

TickerExchange

Industry

Sector

Executives

Headquarters

CEO of

Belongs to

Trades on

Represented by

Located at

Belongs to

Voquette Knowledge Base

Page 35: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

What is it?CACS: Module that categorizes content and automatically creates metadata of content

How does it work?Uses a hybrid of statistical, machine learning and knowledge-base techniques

Features• Core competency – Not only categorizes, but also catalogs (extracts metadata)• Unique solution for semantic metadata extraction from unstructured content• Flexibly adaptable for diverse domains

Structured content

Unstructured content

CACS

Equity Knowledge BaseDefinition

Company

TickerExchange

Industry

Sector

Executives

Headquarters

CEO of

Belongs to

Trades on

Represented by

Located at

Belongs to

Information exchange for metadatacreation

Topic: Company News

Metadata extracted:

Company: ConveraTicker: CNVRExchange: NASDAQ

Industry: Content ManagementSector: Computer SoftwareHeadquarters: Vienna, VAExecutives: Ronald Whittier

Voquette Categorization and Auto-Cataloging System (CACS)

Page 36: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Semantic Engine

What is it?Semantic Engine: Fast main memory-based front end query engine that enables the end-userto retrieve highly relevant and personalized content via custom APIs

Features and Functionality

• Minimal input from user – system intelligent enough to provide only relevant content to user

• Deep levels of personalization

• Applications: Search, personalization, alerts, notifications, directory, routing, syndication

• Custom applications: Research Dashboard (demo)

SemanticEngine

Search

Personalization

Directory

Alerts/Notifications

Syndication

Dashboard

Custom Apps.

EndUsers

User query submitted

Highly relevant Content returned

ContentEnhancementTechnology

Voquette Semantic Engine

Page 37: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Focused relevantcontent

organizedby topic

(semantic categorization)

Automatic ContentAggregationfrom multiple

content providers and feeds

Related relevant content not

explicitly asked for (semantic

associations)

Competitive research inferred

automatically

Automatic 3rd party content

integration

Semantic Application Example – Research Dashboard

Page 38: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

COMTEX Content Enhancement- Value-added metatagging

Limited tagging(mostly syntactic)

COMTEX Tagging

Content‘Enhancement’Rich Semantic

Metatagging

Value-added Voquette Semantic Tagging

Value-addedrelevant metatagsadded by Voquetteto existing COMTEX tags:

• Private companies • Type of company• Industry affiliation• Sector• Exchange• Company Execs• Competitors

Page 39: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

COMTEX Content Enhancement- Tag Normalization

VoquetteKnowledge

Base

Company name: Merrill Lynch & Co.

Source B Document<company_name=Merrill Lynch Corp.>

<company_name=Merrill Lynch & Co.>

<company_name=Merrill Lynch & Co.>

Source A Document<company_name=Merrill Lynch, Inc.>

Source A Document withnormalized tag

Source B Document withnormalized tag

Page 40: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Technology Classification Metadata Features and Advantages Disadvantages and Limitations

Manual Yes Yes Intelligent, adaptable to changing business needs, high levels of accuracy, rapid integration and deployment, minimal upfront investment

Extremely slow, high cost of maintenance and ownership; may not be possible to scale with very high volume; difficult to have uniformity across humans

Information Retrieval/Document Indexing

No No Keyword-based search Typically poor relevance if used alone on a large data set

Clustering May be N/A User/Enterprise does not need to give taxonomy

Many clusters might be meaningless; broad commercial success not yet demonstrated

Lexical/Natural language (NLP)

N/A No Often better than keyword based search; natural language querying/phrases;

Good for summarizing document

Does not help beyond search and summarization ; generally cannot associate one document with other (no inferencing)

Rules-based Yes No Works well with complex taxonomies, high consistency

Intelligence bounded, high cost of maintenance, high computation cost and possible scalability limitations

Classification & Extraction Technology Comparisons

Page 41: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Technology Classification

Metadata Features and Advantages Disadvantages and Limitations

Machine Learning/AI

(Bayesian, HMM, Neural Network)

Yes No User/Enterprise can define taxonomy; combined with indexing can lead to better keyword based search by limited search to a node in taxonomy ; broad variety of technology choices and good experience in applying the technology

User needs to provide training set; retraining needed if taxonomy is changed;

Success dependent on training;

usually unstructured documents/data only- not structured or semi-structured content

Thesaurus, Reference data, (Ontology)

N/A Limited Metadata limited to Terms in reference data or ontology

How is reference data kept up to date? Context is limited and applications are limited to narrow areas; sometimes “one size fits all” good for Web search but not necessarily for Enterprise applications ; power of relationship missing

Domain Model and Information Extractors

Yes Yes For structured data and semi-structured data (Feeds, Web sites); Domain model allows user/enterprise to define contextually relevant metadata;

Allows more precise query formulation (attribute-value);

Homogenization/integration;Semantic search

Need substantial toolkit support for writing extraction, mapping heterogeneous sources to uniform domain model

Knowledge Base (Entities/Classes plus Relationships)

Enhances Enhances Extremely powerful, especially when combined with Domain Model;

Automatic Metadata Enhancement;

very highly relevant search; beyond search (personalization, semantic associations)

Requires creation and maintenance of knowledge base and access to trusted sources for mining/synthesizing knowledge

Classification & ExtractionTechnology Comparisons (Contd.)

Page 42: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Activity Traditional Effort CET Effort Comments

Categorization of Web pages

50 pages/day/editor 1,000 pages/day (with human supervision) [at least an order of magnitude higher without supervision]

Much higher quality metadata generation, in addition to higher quantity

Metatagging of news feeds

10-20 feeds (syntactic + semantic metadata) 100 feeds (syntactic metadata)

5,000-10,000 feeds/day (fully automatic)

No human supervision needed

Metatagging of internal/enterprise research content

50-100 assets/day/research editor

500-1,000 assets (with human supervision)

Human supervision supports higher quality metadata

Metatagging of content from multiple internal or external sources

Content editors using internally developed tools typically manage 1 to 5 sources

Single person can supervise automatic tagging of content from 20-50 sources

 

ROI Comparative Effort Chart

Page 43: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Knowledge Base Toolkit

Extractor Toolkit

NT(any system supporting JVM)

Categorization and Auto Cataloging System

Semantic Engine

Linux/Solaris

WorldModel™ Knowledge Base

More DevelopersMore Sources

. . .

Higher Performance, Redundancy,More content

Deployment System Architecture

Toolkits (Workstation) Enterprise S/W (Server)

Page 44: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Measures

• Quality

– Categorization accuracy: Around 90 % (domain and training dependent)

– Metadata extraction: limited only by WorldModel™ and KB

(for which we have automated maintenance support)

– Relevance: near 100% (unlike IR techniques, typical

precision/recall limitation do not apply when we have

metadata)

• Scalability

– Millions of documents per server (for Semantic Engine)

– Unlimited number of documents due to distributed index seamlessly

spanning multiple servers

– Few to hundreds of content sources (distributed SW agents)

Page 45: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Measures (Continued)

• Performance

– Inclusion of new content source: 2 to 8 hrs

– Building WorldModel™ and Knowledge Base: 2 to 8 weeks per domain

for an effort leading to useful results (approx. 1 million entities and

relationships)

– Extraction – several documents per second (processing time)

– Near real-time search/personalization of new content and breaking

news (sub-minute, due to incremental indexing)

– 1 million queries per hour per server, or 1 to 10s of ms query

response/inference time due to main-memory indexing/data structures

• Robustness

– Semantic Engine has not needed rebooted for over 400 days!

– Many other engineering solutions (HW/SW redundancy) to meet any

SLA

Page 46: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette vs. The Rest

Pages Read and Classified

Voquette AverageHuman

Per Minute

Per Hour

Per Day

Per Year

600 - 10,000 (batch mode)

36,000 – 600,000

864,000 –14.5 Million

315 Million – 5.2 Trillion

1

60

480

120,000

Voquette vs. The Rest

Pages Read ,Classified, Metadata extracted, Normalized & Enhanced

Voquette AverageHuman

Per Minute

Per Hour

Per Day

Per Year

30

1,800

43,200

16 Million

1

60

480

120,000

Reading and Classification

Reading , Classification, Metadata Extraction, Normalization, Enhancement

Quantitative Measures

Page 47: Voquette Company Confidential SCORE. Voquette Company Confidential Presentation Overview Industry Requirements Capabilities System Architecture and Technologies

Voquette Company Confidential

Voquette Specifications

Semantic Engine &Knowledge Base Specs

Voquette

Queries per hour per server

Query Response Time(Lightly loaded server)

Query Response Time(Heavily loaded server)

Semantic associationscreated per hour

Semantic Associations per domain

1 Million

1 to 10 ms

100 to 200 ms

10,000

Over 1 million

Quantitative Comparison (Continued)