47
Adaptive Multi-modal Data Mining and Fusion For Autonomous Intelligence Discovery Edward J. Wegman, Ph.D. Yasmin H. Said, Ph.D.

Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Adaptive Multi-modal Data Mining

and Fusion For Autonomous

Intelligence Discovery

Edward J. Wegman, Ph.D.

Yasmin H. Said, Ph.D.

Page 2: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Outline of Presentation

• Problem Description

• Background in Text Mining

• Outline of System

• Arabic Language Tool

• Geospatial Tool

• Integration of Text and Images

• Streaming Documents

Page 3: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Problem Description

• Consider the plight of an analyst, who is faced with

multimedia sources that stream in data constantly.

• Data can be structured text, unstructured text, voice, images,

and video.

• The data likely are not English language; the data are likely to

be massive in scale; the data are streaming.

• Our premise: The analyst needs a system tool to integrate,

filter, and present to the analyst for his or her consideration

the data that are most likely to be useful.

• The tool should be a query system that must operate

transparently and without significant human fine tuning.

Page 4: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Text Mining

• The roots of the proposed tool are focused in text mining.

• Text mining uses statistical, mathematical, and computer science techniques to extract subtle and unanticipated information and relationships from sets of documents.

• These sets of documents are called corpora.

• Two important methods:

– Cross-corpus discovery.

– Clustering.

Page 5: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Cross Corpus Discovery

•Test case examples

–1200 Science News abstracts.

–350 Naval Research ILIR documents.

Page 6: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

The Approach

Text Data Mining

Via

MST Exploration

Multi-Discipline Document

Set

Minimal Spanning Tree (MST)

Calculation

Interpoint Distance

Calculation

Feature Extraction

(Denoising, stemming,

BPM, TPM)

MST Layout Via

Spring Based Models

Cross Corpora

Associations

Cluster Determination

and

Exploration

Page 7: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Feature Extraction -

Bigram and Trigram Proximity Matrix

“The wise young man sought his father in the crowd.”

Page 8: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

MST Classifier Complexity

Characterization

Insight: the

number of cross

class edges can

be used as a

surrogate for

classification

complexity. These

cross class

(corpora) edges

will be used in our

scheme to

facilitate the cross-

corpora discovery

process.

Page 9: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

The Environment (Opening Screen)

Page 10: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Mathematics and Computer Sciences vs.

Physical Sciences and Technology Second

Strongest Association in MST

Page 11: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Mathematics and Computer Sciences vs.

Physical Sciences and Technology Second

Strongest Association in MST

Page 12: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Anthropology and Archaeology vs.

Medical Sciences Strongest Associated

Articles in the MST

Page 13: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Anthropology and Archaeology vs. Medical Sciences

Strongest Associated Articles Comparison

Page 14: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

A Duplicate in the ILIR Database in the Advanced

Naval Materials Category

NAVSTO

FY99/FY00

Duplicate enters for L. MERWIN and C. RICE

ORGANICALLY MODIFIED CERAMICS FOR CORROSION CONTROL

Page 15: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Two Closely Related Articles in the Human Performance

Factors and the Information Technology and Operations

NUWC

FY01

Dr. Susan S. Kirschenbaum

ADAPTIVE GROUPWARE FOR PLANNING

NUWC

FY99

S. S. KIRSCHENBAUM

TRAINING A SYSTEM

Page 16: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Two Articles in the Information Technology and

Operations that are Identical)

NAVSTO

FY99/FY00

L. VENETSKY

DIRECT ADAPTIVE, GRADIENT DESCENT,

AND GENETIC ALGORITHM TECHNIQUES

FOR FUZZY CONTROLLERS

NAVSTO

FY99/FY00

L. VENETSKY

MISSION SCENARIO CLASSIFICATION USING

PARAMETER SPACE CONCEPT LEARNING

Page 17: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Document Clustering

• An obvious statement: “It is extremely useful to

group documents that are similar.”

• Ultimately, document should be interpreted in a

multimedia sense.

Page 18: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Test Data for this Example

• Our test bed for text data was collected by the Linguistic Data Consortium in 1997.

– The data consisted of 15,863 news reports collected from Reuters and CNN from July 1, 1994 to June 30, 1995.

Page 19: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

• Features – The human classifiers claimed 25 clusters in their

limited document database

– Just as before, we denoise and stem the text data.

Page 20: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Text Example - Clusters

Cluster 0, Size: 157, ISim: 0.142, ESim: 0.008

Descriptive: ireland 12.2%, ira 9.1%, northern.ireland 7.6%, irish 5.5%, fein

5.0%, sinn 5.0%, sinn.fein 5.0%, northern 3.2%, british 3.2%, adam 2.4%

Discriminating: ireland 7.7%, ira 5.9%, northern.ireland 4.9%, irish 3.5%,

fein 3.2%, sinn 3.2%, sinn.fein 3.2%, northern 1.6%, british 1.5%, adam

1.5%

Phrases 1: ireland 121, northern 119, british 116, irish 111, ira 110, peac 107,

minist 104, govern 104, polit 104, talk 102

Phrases 2: northern.ireland 115, sinn.fein 95, irish.republican 94,

republican.armi 91, ceas.fire 87, polit.wing 76, prime.minist 71, peac.process

66, gerri.adam 59, british.govern 50

Phrases 3: irish.republican.armi 91, prime.minist.john 47, minist.john.major

43, ira.ceas.fire 35, ira.polit.wing 34, british.prime.minist 34, sinn.fein.leader

30, rule.northern.ireland 27, british.rule.northern 27, declar.ceas.fire 26

Page 21: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Text Example - Clusters

Cluster 1, Size: 323, ISim: 0.128, ESim: 0.008

Descriptive: korea 19.8%, north 13.2%, korean 11.2%, north.korea 10.8%, kim

5.8%, north.korean 3.7%, nuclear 3.5%, pyongyang 2.0%, south 1.9%,

south.korea 1.5%

Discriminating: korea 12.7%, north 7.4%, korean 7.2%, north.korea 7.0%, kim

3.8%, north.korean 2.4%, nuclear 1.7%, pyongyang 1.3%, south.korea 1.0%,

simpson 0.8%

Phrases 1: korea 305, north 303, korean 285, south 243, unit 215, nuclear 204,

offici 196, pyongyang 179, presid 167, talk 165

Phrases 2: north.korea 291, north.korean 233, south.korea 204, south.korean

147, kim.sung 108, presid.kim 83, nuclear.program 79, kim.jong 74, light.water

71, presid.clinton 69

Phrases 3: light.water.reactor 56, unit.north.korea 55, north.korea.nuclear 53,

chief.warrant.offic 49, presid.kim.sung 46, leader.kim.sung 39, presid.kim.sam

37, north.korean.offici 36, warrant.offic.bobbi 35, bobbi.wayn.hall 29

Page 22: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Outline of System

Four core capabilities:

• Text and image mining for feature extraction

• Multi-modal data fusion

• Agent-based adaptive information filtering

• Cognitively friendly information visualization

Page 23: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Outline of System

Unstructured TextStructured Text

RelationalDatabase

Speech recognitionengine

EmailInternet chat

record

Interceptedphone calls

Recordedconversations

Speech Audio

Structured textfeature extractor

Unstructured textfeature extractor

Static imagery(geo-spatial)

Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent

Textual informationfilter Filter parameters

Text filtering agent

Image miner

Filter parameters

Image filtering agent

Imagefeature extractor

Image Filter

KQMLKQML

Page 24: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Arabic Language Tool

Unstructured TextStructured Text

RelationalDatabase

Speech recognitionengine

EmailInternet chat

record

Interceptedphone calls

Recordedconversations

Speech Audio

Structured textfeature extractor

Unstructured textfeature extractor

Static imagery(geo-spatial)

Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent

Textual informationfilter Filter parameters

Text filtering agent

Image miner

Filter parameters

Image filtering agent

Imagefeature extractor

Image Filter

KQMLKQML

Page 25: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Arabic Language Tool

• Our fundamental premise is that Arabic language documents, open source and otherwise, provide valuable insight.

• Open source documents are streaming.

• Not enough Arabic language experts are available to translate everything.

• We need a system for English language queries to an Arabic language text database.

Page 26: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Arabic Language Tool

Page 27: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Arabic Language Tool

• Basic functionality

– Arabic language documents are background processed,

stemmed, denoised, clustered, bigrammed.

• Bigrams are attached as metadata.

– English language query is translated to Arabic

• Query is divided into multiple bigrams.

– Reduced Arabic language document set is presented to

analyst for consideration and translation.

Page 28: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Arabic Language Tool

• Status

– Native Arabic speaker, Eiman Alshammari, is our graduate student developing tool.

– We met with the Arabic Language Data Mining Group in Cairo and secured cooperation and an Arabic language corpus.

• Professor Aly Fahmy, Dean of the Faculty of Computers and Information, Cairo University.

• Dr. Amir Atiya, Associate Professor of Computer Engineering, Cairo University.

• Dr. Ahmed S. Moussa, Program Manager, Smart Village.

– We met with representatives of King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia.

• Dr. Turki Saud Mohammed Al-Saud, Vice President Research Institutes

• Dr. Mansour M. Alghamidi, Director, Computers and Electronics

• Dr. Ibrahim A. Al-Kharashi, Arabic Language Projects

– Project is underway … Eiman is anxious to graduate.

Page 29: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Geospatial Tool

Unstructured TextStructured Text

RelationalDatabase

Speech recognitionengine

EmailInternet chat

record

Interceptedphone calls

Recordedconversations

Speech Audio

Structured textfeature extractor

Unstructured textfeature extractor

Static imagery(geo-spatial)

Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent

Textual informationfilter Filter parameters

Text filtering agent

Image miner

Filter parameters

Image filtering agent

Imagefeature extractor

Image Filter

KQMLKQML

Page 30: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Geospatial Tool

• Basic Functionality

• Develop a geospatial visualization tool for both

display and query.

• Locate source IP addresses.

• Locate imagery and video sources geospatially based

on geospatial metadata.

• Query geospatial coordinates for multimedia

documents in the database.

Page 31: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Geospatial Tool

Page 32: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Geospatial Tool

Page 33: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Geospatial Tool

Page 34: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Geospatial Tool

• Status

• Felix Mihai and In-ja Youn are graduate students

developing this tool.

• The basic map functionality is available

• IP locator is underway

• Geospatially located satellite image database is

also available (MISR imagery)

• Graduate student funding is a problem for Felix in

particular

Page 35: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Integration of Text and Images

Unstructured TextStructured Text

RelationalDatabase

Speech recognitionengine

EmailInternet chat

record

Interceptedphone calls

Recordedconversations

Speech Audio

Structured textfeature extractor

Unstructured textfeature extractor

Static imagery(geo-spatial)

Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent

Textual informationfilter Filter parameters

Text filtering agent

Image miner

Filter parameters

Image filtering agent

Imagefeature extractor

Image Filter

KQMLKQML

Page 36: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Integration of Text and Images

• Functionality Desired

– Attach metadata to images and to text either endogenously or exogenously

• Be able to query an image for related text documents

– E.g, Who is this a picture of? What is this a picture of?

• Be able to query a text document to identify related images

– E. g., Find me a picture of this named person. Find me a picture of this facility.

Page 37: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Integration of Text and Images

• Two approaches – The bigram proximity matrix (for text documents) and

the gray level co-occurrence matrix (for images) have

the same basic structure.

• Work is underway to develop and exploit this characteristic

– Integrated text and image documents (such as news

documents, video with voice) may be deconstructed to

provide metadata data for each other.

• Not yet implemented (google image does this for webpages)

Page 38: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Integration of Text and Images

• Status

– Peter Mburu is the graduate student identified

to work on this part of the project

• Work has just begun … this is a hard problem.

• Peter is very bright, but not yet in candidacy.

Page 39: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Streaming Documents

Unstructured TextStructured Text

RelationalDatabase

Speech recognitionengine

EmailInternet chat

record

Interceptedphone calls

Recordedconversations

Speech Audio

Structured textfeature extractor

Unstructured textfeature extractor

Static imagery(geo-spatial)

Video(geo-spatial)

Text miner

...

...

Human analyst

Personaluser agent

Textual informationfilter Filter parameters

Text filtering agent

Image miner

Filter parameters

Image filtering agent

Imagefeature extractor

Image Filter

KQMLKQML

Page 40: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Streaming Documents

• Functionality Desired

– Process streaming text documents.

• Vector space representation of a document.

• Streaming documents imply evolving lexicon.

– Recursive computation of document frequency.

– Use evolving lexicon.

• Track evolving sense of documents.

• Introduce new query terms.

• Classify new documents.

Page 41: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Streaming Documents

Page 42: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Streaming Documents

Page 43: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Streaming Documents

Number of times word

w is in document d.

Number of documents

that contain word w.

Size of the corpus

Page 44: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Streaming Documents

Page 45: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Streaming Documents

• Status

– Graduate students, Elizabeth Leeds Hohman and Loulwah Al-Samait, are separately working on streaming documents.

• Elizabeth is developing a visual representation using graph theory of streaming document clusters.

• Loulwah is developing a method for understanding evolving sense of documents.

– Theory development is relatively advanced, system development is less so.

• Project has been underway about 4 months.

Page 46: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Work Left to Be Done!

• Lots!

– Progress is good and a number of bright students are working on the project

– The Arabic Text Tool should be in hand by December.

– The Geospatial Tool is fairly advanced, but Felix has no funding and is fragile.

– The Text and Image Integration is at early stages and is probably the most difficult conceptually.

– The Streaming Text Tools are advanced theoretically, but system development is not yet underway

– Filtering tasks and system integration has not yet begun.

– But, we have only been at it for four months.

Page 47: Adaptive Multi-modal Data Mining and Fusion For Autonomous … · 2012-12-31 · Outline of Presentation • Problem Description • Background in Text Mining ... –Native Arabic

Contact Information

Edward J. Wegman, Ph.D.

Center for Computational Data Science

George Mason University, MS 6A2

Fairfax, VA 22030-4444

[email protected]

Yasmin H. Said, Ph.D.

Center for Computational Data Science

George Mason University, MS 6A2

Fairfax, VA 22030-4444

[email protected]