43
AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas Southern University Texas State University, San Marcos

AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

  • View
    225

  • Download
    1

Embed Size (px)

Citation preview

Page 1: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

AT&T Labs – ResearchBell Labs/Lucent Technologies

Princeton UniversityRensselaer Polytechnic Institute

Rutgers, the State University of New JerseyTexas Southern University

Texas State University, San Marcos

Page 2: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 2

Background• DHS has established an Institute for Discrete Sciences (IDS).• Managed Out of Lawrence Livermore Nat. Lab.• DHS is establishing four “university affiliated centers” around the

country.• One of these will be a “coordinating” UAC.• The Rutgers-based team has been designated as a UAC and was asked

to become a coordinating UAC.• Other centers: Univ. of Illinois Urbana-Champaign, Univ. Southern

Cal., U. Pittsburgh• In addition to 6 formal partners, we have told DHS that NJ University

Consortium institutions will be involved.

Page 3: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 3

What is Discrete Science?• Discrete Science deals with

– Patterns– Arrangements– Assignments– Schedules

• Discrete Science – Seeks patterns in large amounts of data– Analyzes connections between entities

such as people and groups– Develops efficient ways to quickly

spot changes in standard patterns

Page 4: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 4

Why DyDAn?

• Homeland Security requires inferences from massive flows of data, arriving continuously.

• Buried in data are: quickly changing patterns.

• DyDAn: will develop novel technologies to find patterns & relationships in dynamic, nonstationary, massive datasets.

• DyDAn: will produce pioneering educational programs to nurture homeland security workforce of the future

Page 5: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 5

DyDAn Research

• Information Management and Knowledge Discovery

• Fundamental Topics in Discrete Mathematical Foundations

• Two research themes:– Analysis of Large, Dynamic Multigraphs– Continuous, Distributed Monitoring of

Dynamic, Heterogeneous Data

Page 6: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 6

DyDAn Research I: Analysis of Large, Dynamic Multigraphs

• Need to understand interactions between entities: people, objects, groups

• Interactions often modeled as graphs– Linking nodes (entities) with edges (connections)

• Multiple relationships between entities suggests multigraphs

• Add new entities, new & changing connections suggests dynamic multigraphs.

• Develop methods to represent, analyze, interrogate, & navigate dynamic multigraphs.

Page 7: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 7

DyDAn Research I: Analysis of Large, Dynamic Multigraphs

Page 8: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 8

DyDAn Research II: Continuous, Distributed Monitoring of Dynamic,

Heterogeneous Data• Need to understand massive amounts of data.• Data inherently distributed (multiple sources)• Data arrives rapidly – “continuously”• Seek anomalies, patterns, “emerging events”• Run continuous queries to monitor incoming

data stream.• Data takes numerous forms; requires data

mining methods that span the modalities.

Page 9: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 9

DyDAn Research Portfolio: Flexibility

• 9 initial projects, 5 in Area I, 4 in Area II

• Not all starting in year 1.

• All leverage off previous work and additional funding from Rutgers.

• Portfolio reviewed regularly with DHS, national lab partners, and other DHS centers; can readily change to newly-identified needs.

Page 10: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 10

DyDAn Research Portfolio: Large Graphs Projects

• Universal Information Graphs (initial emphasis) • Adding Semantics to and Interconnecting

Semantic Graphs (initial emphasis)• Analyzing Large, Dynamic Multigraphs Arising

from Blogs• Algorithms for Identifying Hidden Social

Structures• Statistical and Graph-theoretical Approaches to

Time-Varying Multigraphs (Initial emphasis)

Page 11: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 11

DyDAn Research Portfolio: Dynamic Data Projects

• Message Filtering and Entity Resolution• Continuous, Distributed Data Stream Modeling• Optimization and Data Analysis (Initial emphasis)• Dynamic Similarity Search in Multi-Modal Data

Page 12: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 12

DyDAn Data

• Emphasis on publicly available data.• How to acquire, publish, analyze, store data in a

private, secure way.• Privacy-preserving data analysis.• How to generate synthetic data sets that have the

characteristics of real data but mask protected aspects.

• Director of Data Analysis will work on all aspects of acquiring, sharing, publicizing analyzing data: privacy, legal, technical, etc.

Page 13: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 13

DyDAn Educational Programs• Great need to train people to work in homeland

security.

• Key DyDAn performers: record of integrating research and education from K-12 to postgraduate.

• Integration of research and education: students in all research projects.

• Integration of research and education: research themes into educational programs.

Page 14: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 14

DyDAn Educational Programs

• Workshops, tutorials, shortcourses: most open to all• New courses, certificate programs, faculty training

– Repository for information about homeland security courses nationally

– New homeland security certificate programs: RPI, RU, TSU– Website to disseminate our models nationally– Program for national college faculty

• Extensive program of “research experiences for undergraduates.”– Students from around the US in residence at DyDAn

Page 15: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 15

DyDAn Educational Programs

• Internships/Visits – by students/faculty to national labs, corporate

partner locations, and DyDAn. – by national lab, DHS, other UAC scientists to

DyDAn

• K-12 programs: – To build early awareness of educational and

career opportunities in homeland security– Annual high school teacher “short course” in

discrete math and homeland security

Page 16: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 16

Leadership as a Coordinating UAC• Building on extensive experience managing large,

complex scientific & educational enterprises. – Based at DIMACS (Center for Discrete Mathematics and

Theoretical Computer Science).

– An original NSF “science and technology center”– 13 partner institutions (5 universities, 8 companies)– Large portfolio of research & educational programs with

international scope

Page 17: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 17

A Resource for NJ• Connecting to the NJ Universities Homeland

Security Research Consortium: Seek to involve all Consortium universities

• Building on Relationships with State and Local Agencies

• Advisory Committee: State and National Representatives

• DyDAn Events open to NJ university, industry, and government participants and designed with their help.

• Connecting NJ to DHS officials and efforts nationally.

Page 18: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 18

DyDAn Research• Information Management and Knowledge

Discovery• Fundamental Topics in Discrete Mathematical

Foundations• Two research themes:

– Analysis of Large, Dynamic Multigraphs– Continuous, Distributed Monitoring of

Dynamic, Heterogeneous Data

Page 19: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 19

James Abello & Fred Roberts (Rutgers Univ.)

Kiran Chilakamarri (Texas Southern University)

Nate Dean(Texas State University- San

Marcos)

Project: Universal Information Graphs

Page 20: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 20

Overview and Connection to Problems of Homeland Security

•A variety of different massive data sources are available to analysts: Web, Internet, Calls, Email, Transportation, …•Problem: Coordinate information from multiple sources, to identify “interesting” collaborative information networks.

Web Internet

Call DetailAir Traffic

AttackGraphs Market

Baskets

Page 21: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 21

Overview and Connection to Problems of Homeland Security

•Model each data source as a large multidigraph•Edges give information•Too much information to actually fuse all these multidigraphs into one.•Challenge: Fuse collection of multidigraphs in useful ways.

.

Page 22: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 22

Alex Borgida (Rutgers University)

Lila Ghemri (Texas Southern University)

Peter F. Patel-Schneider (Bell Labs Research)

Project: Adding Semantics to and Interconnecting Semantic Graphs

Page 23: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 23

Overview and Connection to Problems of Homeland Security

• Information of interest to DHS is often stored using “shallow” representations.– Much of the information is in English tags– Susceptible to ambiguity, incompleteness, etc.– These representations are nonetheless very useful

• Alleviate such shallowly represented information by augmenting with rich ontologies that describe and prescribe how a domain works– can discover information inherent in shallow information– can expose inconsistencies in shallow information

• Problem - reasoning with rich information is computationally expensive.

Page 24: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 24

Planned Work through DyDAn

• Extend OWL Web Ontology Language, a powerful ontology language for use with shallow information

• Extend and specialize theory of Distributed Description Logics (DDLs), designed to limit interactions to lessen computational load

• Develop and extend a highly-optimized reasoner to improve its performance with large amounts of shallow information

• Study how dynamic change interacts with reasoners

Page 25: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 25

Project: Analysis of Large, Dynamic Multigraphs Arising

from BlogsJames Abello (Rutgers &

Ask.com) Graham Cormode

(AT&T Labs – Research)S. Muthukrishnan (Rutgers Univ.)

Page 26: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 26

Multigraphs in Security Applications• Intelligence data is well-modeled by large, evolving

multigraphs– Nodes: entities Edges: connections

– Many links between same pair of entities denote different interactions at different times

– Relationships change (slowly, rapidly) over time.

• Examples:– (User IDs, emails/telephone calls),

– (Text reports/blogs/webpages, implicit/explicit links)

• Our research: acquiring and analyzing multigraphs from different applications.

Page 27: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 27

Overview and Connection to Problems of Homeland Security

• Blogs are an example of open source data– Large, highly-interconnected source of timely posts

on observations, experiences, events, politics etc. from citizen observers (sloggers).

– Chaotic source of information. What (mis)information is being propagated?

– Challenging: find trustworthy sources.

• Goal: develop techniques for labeling multigraphs in intelligence applications, apply them to blogs.

Page 28: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 28

Project: Algorithms for Identifying Hidden Social

Structures in Virtual Communities

Yuliy Baryshnikov (Bell Labs)

Mark Goldberg (RPI)

Malik Magdon-Ismail (RPI)

William (Al) Wallace (RPI)

Page 29: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 29

Overview and Connection to Problems of Homeland Security

• Prior to their acting, the perpetrators discuss and plan using a variety of communication media.

• Challenge: Find hidden groups, coalitions and leaders by non-semantic analysis of large communication networks.

• Ideal result: Find a suspicious group based on its pre-event communication activity, before they act.

• Useful forensic result: Ex-post discovery of the relationship between the act and communication burst.

Page 30: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 30

Project: Statistical and Graph-Theoretical Approaches to Time-

Varying Multigraphs

Colin Goodall, AT&T Labs – Research

Robert Bell, AT&T Labs – Research

David Madigan, Rutgers University

Page 31: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 31

• A COI (Community of Interest) isan effective summary of significantconnections in a graph.

• Use COI for very large scale analysisof a dynamic graph:– Stark change in COI indicates an anomaly– Has an entity changed its id?– New cliques?

• Goal: To analyze and apply automated anomaly detection to COI’s of dynamic multigraphs in telecomm, blogs, and intelligence data.

Overview and Connection To Problems of Homeland Security

Page 32: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 32

Project: Message Filtering and Entity Resolution

Endre Boros (Rutgers)

Lila Ghemri (TSU)

Tin Kam Ho (Bell Labs)

Paul Kantor (Rutgers)

David Madigan (Rutgers)

Richard Mammone (Rutgers)

Debasis Mitra (Bell Labs)

Page 33: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 33

Continuous Message Filtering and Entity Resolution in the Distributed Environment

• Vast amount of data flow into or through monitoring points

• Chaff must be discarded, meaningful messages and patterns of messages must be detected– in real time

– with limited communication among the processing and monitoring nodes

– with minimal interruption of normal communication and privacy

– and maximum effectiveness

Page 34: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 34

Overview of Research Problem

10 million messages a day. Billions of possible identifications

Multiple modeling and learning technologies

Multiple optimization and combination technologies

Thousands of potentially important messages, identifications, etc.

Operations

New Research• Work on

automatically learning to identify topics, events, and actors in messages

• Recognizing the same entity (an actor, a target, and organization) under different aliases.

Page 35: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 35

Connection to Problems of Homeland Security

• Millions/ tens of millions of messages should be screened for patterns of interest

• Actors often hide behind false identities– Reveal themselves by

• Language• Style• Connections to other actors

• Goal is to maximize screening effectiveness, minimize false positives – disruption to individuals and to commerce

• Early, positive impact on detection of agents and organizations

Page 36: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 36

Project: Continuous, Distributed Data Stream Monitoring

Moses Charikar (Princeton)

Graham Cormode (AT&T Labs – Research)

S. Muthukrishnan (Rutgers)

Page 37: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 37

Overview and Connection to Problems of Homeland Security

• Data is massive, distributed, and evolving– inconvenient or impossible to collect together in one place– still need to monitor, identify patterns, correlate

• Example: – monitoring streams of text: emails, blogs, newsfeeds, field

reports. – identify patterns — profiles, clusters, outliers — that occur

across multiple sites

• Technical challenge: – must be accurate, avoid false positives. – optimize how (much) data is communicated between agents.

Page 38: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 38

Alexandre d'Aspremont (Princeton)Yuliy Baryshnikov (Bell Labs)

Savas Dayanik (Princeton)Paul Kantor (Rutgers)

Kai Li (Princeton)Warren Powell (Princeton)

Seyed Roosta (TSU)

Project: Optimization and Data Analysis

Page 39: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 39

Overview and Connection to Problems of Homeland Security

• Dynamic data raises optimization issues:– Has rate of messages sent by X changed?– Has flow of cash into/out of organization Y changed?– Has there been unexpected change in travel plans?

• View as learning problem; use optimal learning strategies.

• Research challenges: – Optimal detection of changes in signals– Optimal recursive estimation– Rapid classification/pattern identification

Page 40: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 40

Project: Dynamic Similarity Search in Multi-Modal Data

Moses Charikar, Perry Cook, Kai Li, and Olga Troyanskaya

(Princeton)

Ken Clarkson, Tin Kam Ho, and Haobo Ren

(Bell Labs)

Page 41: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 41

Overview and Connection to Problems of Homeland Security

• Data arising in homeland security comes from many modalities– Often such data are sensor data (audio, images, video, etc.)

which are noisy and require similarity match and similarity search

– Feature extractions are difficult and such features are high dimensional

• Multi-modal data of interest are massive– Current content-based similarity search and classification are

limited to small scale– “Curse of dimensionality”

Page 42: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

Slide 42

Overview and Connection to Problems of Homeland Security

• How to build similarity search systems for multi-modal data is not well understood– How to manage and search at scale

– How to integrate annotations/attributes based search with content-based search

Page 43: AT&T Labs – Research Bell Labs/Lucent Technologies Princeton University Rensselaer Polytechnic Institute Rutgers, the State University of New Jersey Texas

We are looking forward to collaborating with the DHS Institute for Discrete Sciences and to involving the NJ homeland security community in the new center