©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo – Automated Content Management System
Srikanth Kallurkar
Quantum Leap InnovationsWork Performed under AFRL contract FA8750-06-C-0052
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Capabilities• Automated domain relevant information gathering
– Gathers documents relevant to domains of interest from www or proprietary databases.
• Automated content organization– Organizes documents by topics, keywords, sources, time
references and features of interest.
• Automated information discovery– Assists users with automated recommendations on related
documents, topics, keywords, sources…
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Comparison to existing manual information gathering method (what most users do currently)
3. Search 6. Satisfied
7. Refine Query (Conjure up new keywords)
5. ExamineResults
Search Engine Interface
Generalized Search Index
4. Results
1. DevelopInformation
NeedUser
Yes
No
Query
Data
Take a break
The goal is to maximize the results for a user keyword
query
User performs a “Keyword Search”
7a. Give up
Index: User Task Search Engine Task Data
2. Form Keywords
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Information Gathering method (what users do with Apollo)
3. Filter 6.Satisfied
5. ExamineResults
Apollo Interface
Specialized Domain Model
4. Results
1. DevelopInformation
NeedUser
Data
The focus is on informative results seeded by a user selected combination of
features
User explores, filters and discovers documents assisted
by Apollo features
Features
7. Discover new/related information via Apollo features
Yes
No
Take a break
Features - Vocabulary, Location, Time, Sources …
Index: User Task Apollo Task Data
2. Explore Features
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Domain Modeling (behind the scenes)
1. Bootstrap Domain
2. Define domain, topics, subtopics
3. Get Training Documents (Option A/B/AB)
Build Representative
Keywords
QuerySearch Engine (s)
Curate (optional)
A. From the Web
B. From Specialized Domain Repository
(Select a small sample)
4. Build Domain Signature
Identify Salient
Terms per Domain, Topic, Sub topic
Compute Classification
Threshold
5. Organize Documents (Option A/B/AB)
Filter Documents
Extract Features - Vocabulary,
Location, Time …
A. From the WebB. From
Specialized Domain
Repository
Classify into defined
topics/subtopics
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Data Organization
Document
Data Source
Classify into defied domain topics/subtopics
Is in Domain
Extract Features:domain relevant vocabulary
locations, time references, sources, …
Store document
e.g. Web Site, Proprietary database, ...
e.g. Published Article,News Report, Journal Paper, …
Apollo collection process
Data SourceData
Source
Data SourceData
Source
Data SourceData
Source …
DocumentDocumentDocumentDocumentDocumentDocument
DocumentDocumentDocument …
DiscardNo
Yes
Apollo collection/organization process
Domain A
Apollo library of domain relevant documents
Feature A Doc 1 Doc 2 Doc N
Organize documents by features
Domain BDomain C
…
Snapshot of Apollo process to collect a domain relevant document
Snapshot of Apollo process to evolve domain relevant libraries
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Information Discovery
User selects a feature via the Apollo interface
Apollo builds a set of documents from the library that contains the feature
Apollo collates all other features from the set and ranks them by domain relevance
User is presented with co-occurring features
e.g.: user selects phrase “global warming” from
domain “climate change”
A set of n documents containing phrase “global warming”
e.g. user sees phrase “greenhouse gas emissions”
And “ice core”
as phrases co-occurring with “global warming” and explores
documents containing the phrases
User can use discovered features to expand or restrict the focus of search based on driving interests
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Illustration: Apollo Web Content Management Application for the
domain “Climate Change”
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
“Climate Change” Domain Model
Vocabulary (Phrases, Keywords, idioms)identified for the domain from training documents collected from the web
Building blocks of themodel of the domain
Modeling error based on noise in the training data
Can be reduced by input from human experts
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Prototype
Keyword Filter
Extracted “Keywords” orPhrasesacross the collection of documents
Documentresults of filtering
Domain
Automated Document Summary
Extracted “Locations”across the collection of documents
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Inline Document View
FilterInterface
Additional Features
Features extracted only for this document
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Expanded Document View
Features extracted for this document
Cached text of the Document
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Automatically Generated Domain Vocabulary
Importancechanges as the library changes
Vocabulary collated across domain library
Font size and thickness shows domain importance
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Experiment Setup• The experiment setup comprised the Text Retrieval
Conference (TREC) document collection from the 2002 filtering track [1]. The document collection statistics were:– The collection contained documents from Reuters Corpus Volume 1. – There were 83,650 training and 723,141 testing documents– There were 50 assessor and 50 intersection topics. The assessor topics
had relevance judgments from human assessors where as the intersection topics were constructed artificially from intersections of pairs of Reuters categories. The relevant documents are taken to be those to which both category labels have been assigned.
– The main metrics were T11F or FBeta with a coefficient of 0.5 and T11SU as a normalized linear utility.
1. http://trec.nist.gov/data/filtering/T11filter_guide.html
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Experiment• Each topic was set as an independent domain in Apollo.• Only the set of relevant documents from the training set of the topic were
used to create the topic signature. • The topic signature was used to output a vector – called the filter vector –
that comprised single word terms that were weighted by their ranks. • A threshold of comparison was calculated based on the mean and
standard deviation of the cross products of the training documents with the filter vector.
• Different distributions were assumed to estimate the appropriate thresholds.
• In addition, the number of documents to be selected was set to be a multiple of the training sample size.
• The entire testing set was indexed using Lucene. • For each topic, the documents were compared using the cross product
with the topic filter vector in the document order prescribed by TREC.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Initial Results
• Initial results show that Apollo filtering effectiveness is very competitive with TREC benchmarks
• Precision and recall can be improved by leveraging additional components of the signatures.
50 Assessor Topics Avg. Recall Avg. Precision Avg. T11F(FBeta)
Apollo 0.35 0.63 0.499
TREC Benchmark KerMit [2] - 0.43 0.495
2. Cancedda et al, “Kernel Methods for Document Filtering” in the NIST special publication 500:251: Proceedings of the Eleventh Text Retrieval Conference, Gaithersburg, MD, 2002.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Topic Performance Recall v/s T11F
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Topics
Recall
FBeta
Precision v/s T11F
0
0.2
0.4
0.6
0.8
1
1.2
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
Topics
Precision
FBeta
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved.
Apollo Filtering Performance• Apollo training period was linear to the number and size of the training
set (num training docs vs. avg. training time). • On average, the filtering time per document was constant (avg. test
time).
num training doc s
020406080
100120140160
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
avg training time (ms )
0500
1000150020002500300035004000
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
avg tes t time per doc (ms )
0
0.1
0.2
0.3
0.4
0.5
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49