Download ppt - Apollo – Automated Content Management System

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.

Apollo – Automated Content Management System

Srikanth Kallurkar

Quantum Leap InnovationsWork Performed under AFRL contract FA8750-06-C-0052


Capabilities• Automated domain relevant information gathering

– Gathers documents relevant to domains of interest from www or proprietary databases.

• Automated content organization– Organizes documents by topics, keywords, sources, time

references and features of interest.

• Automated information discovery– Assists users with automated recommendations on related

documents, topics, keywords, sources…


Comparison to existing manual information gathering method (what most users do currently)

3. Search 6. Satisfied

7. Refine Query (Conjure up new keywords)

5. ExamineResults

Search Engine Interface

Generalized Search Index

4. Results

1. DevelopInformation

NeedUser

Yes

No

Query

Data

Take a break

The goal is to maximize the results for a user keyword

query

User performs a “Keyword Search”

7a. Give up

Index: User Task Search Engine Task Data

2. Form Keywords


Apollo Information Gathering method (what users do with Apollo)

3. Filter 6.Satisfied

5. ExamineResults

Apollo Interface

Specialized Domain Model

4. Results

1. DevelopInformation

NeedUser

Data

The focus is on informative results seeded by a user selected combination of

features

User explores, filters and discovers documents assisted

by Apollo features

Features

7. Discover new/related information via Apollo features

Yes

No

Take a break

Features - Vocabulary, Location, Time, Sources …

Index: User Task Apollo Task Data

2. Explore Features


Apollo Architecture


Apollo Domain Modeling (behind the scenes)

1. Bootstrap Domain

2. Define domain, topics, subtopics

3. Get Training Documents (Option A/B/AB)

Build Representative

Keywords

QuerySearch Engine (s)

Curate (optional)

A. From the Web

B. From Specialized Domain Repository

(Select a small sample)

4. Build Domain Signature

Identify Salient

Terms per Domain, Topic, Sub topic

Compute Classification

Threshold

5. Organize Documents (Option A/B/AB)

Filter Documents

Extract Features - Vocabulary,

Location, Time …

A. From the WebB. From

Specialized Domain

Repository

Classify into defined

topics/subtopics


Apollo Data Organization

Document

Data Source

Classify into defied domain topics/subtopics

Is in Domain

Extract Features:domain relevant vocabulary

locations, time references, sources, …

Store document

e.g. Web Site, Proprietary database, ...

e.g. Published Article,News Report, Journal Paper, …

Apollo collection process

Data SourceData

Source

Data SourceData

Source

Data SourceData

Source …

DocumentDocumentDocumentDocumentDocumentDocument

DocumentDocumentDocument …

DiscardNo

Yes

Apollo collection/organization process

Domain A

Apollo library of domain relevant documents

Feature A Doc 1 Doc 2 Doc N

Organize documents by features

Domain BDomain C

…

Snapshot of Apollo process to collect a domain relevant document

Snapshot of Apollo process to evolve domain relevant libraries


Apollo Information Discovery

User selects a feature via the Apollo interface

Apollo builds a set of documents from the library that contains the feature

Apollo collates all other features from the set and ranks them by domain relevance

User is presented with co-occurring features

e.g.: user selects phrase “global warming” from

domain “climate change”

A set of n documents containing phrase “global warming”

e.g. user sees phrase “greenhouse gas emissions”

And “ice core”

as phrases co-occurring with “global warming” and explores

documents containing the phrases

User can use discovered features to expand or restrict the focus of search based on driving interests


Illustration: Apollo Web Content Management Application for the

domain “Climate Change”


“Climate Change” Domain Model

Vocabulary (Phrases, Keywords, idioms)identified for the domain from training documents collected from the web

Building blocks of themodel of the domain

Modeling error based on noise in the training data

Can be reduced by input from human experts


Apollo Prototype

Keyword Filter

Extracted “Keywords” orPhrasesacross the collection of documents

Documentresults of filtering

Domain

Automated Document Summary

Extracted “Locations”across the collection of documents


Inline Document View

FilterInterface

Additional Features

Features extracted only for this document


Expanded Document View

Features extracted for this document

Cached text of the Document


Automatically Generated Domain Vocabulary

Importancechanges as the library changes

Vocabulary collated across domain library

Font size and thickness shows domain importance


Apollo Performance


Experiment Setup• The experiment setup comprised the Text Retrieval

Conference (TREC) document collection from the 2002 filtering track [1]. The document collection statistics were:– The collection contained documents from Reuters Corpus Volume 1. – There were 83,650 training and 723,141 testing documents– There were 50 assessor and 50 intersection topics. The assessor topics

had relevance judgments from human assessors where as the intersection topics were constructed artificially from intersections of pairs of Reuters categories. The relevant documents are taken to be those to which both category labels have been assigned.

– The main metrics were T11F or FBeta with a coefficient of 0.5 and T11SU as a normalized linear utility.

1. http://trec.nist.gov/data/filtering/T11filter_guide.html


Experiment• Each topic was set as an independent domain in Apollo.• Only the set of relevant documents from the training set of the topic were

used to create the topic signature. • The topic signature was used to output a vector – called the filter vector –

that comprised single word terms that were weighted by their ranks. • A threshold of comparison was calculated based on the mean and

standard deviation of the cross products of the training documents with the filter vector.

• Different distributions were assumed to estimate the appropriate thresholds.

• In addition, the number of documents to be selected was set to be a multiple of the training sample size.

• The entire testing set was indexed using Lucene. • For each topic, the documents were compared using the cross product

with the topic filter vector in the document order prescribed by TREC.


Initial Results

• Initial results show that Apollo filtering effectiveness is very competitive with TREC benchmarks

• Precision and recall can be improved by leveraging additional components of the signatures.

50 Assessor Topics Avg. Recall Avg. Precision Avg. T11F(FBeta)

Apollo 0.35 0.63 0.499

TREC Benchmark KerMit [2] - 0.43 0.495

2. Cancedda et al, “Kernel Methods for Document Filtering” in the NIST special publication 500:251: Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, MD, 2002.


Topic Performance Recall v/s T11F

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Topics

Recall

FBeta

Precision v/s T11F

0

0.2

0.4

0.6

0.8

1

1.2

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49

Topics

Precision

FBeta


Apollo Filtering Performance• Apollo training period was linear to the number and size of the training

set (num training docs vs. avg. training time). • On average, the filtering time per document was constant (avg. test

time).

num training doc s

020406080

100120140160

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

avg training time (ms )

0500

1000150020002500300035004000

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

avg tes t time per doc (ms )

0

0.1

0.2

0.3

0.4

0.5

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49