AT&T Labs – ResearchBell Labs/Lucent Technologies
Princeton UniversityRensselaer Polytechnic Institute
Rutgers, the State University of New JerseyTexas Southern University
Texas State University, San Marcos
Slide 2
Background• DHS has established an Institute for Discrete Sciences (IDS).• Managed Out of Lawrence Livermore Nat. Lab.• DHS is establishing four “university affiliated centers” around the
country.• One of these will be a “coordinating” UAC.• The Rutgers-based team has been designated as a UAC and was asked
to become a coordinating UAC.• Other centers: Univ. of Illinois Urbana-Champaign, Univ. Southern
Cal., U. Pittsburgh• In addition to 6 formal partners, we have told DHS that NJ University
Consortium institutions will be involved.
Slide 3
What is Discrete Science?• Discrete Science deals with
– Patterns– Arrangements– Assignments– Schedules
• Discrete Science – Seeks patterns in large amounts of data– Analyzes connections between entities
such as people and groups– Develops efficient ways to quickly
spot changes in standard patterns
Slide 4
Why DyDAn?
• Homeland Security requires inferences from massive flows of data, arriving continuously.
• Buried in data are: quickly changing patterns.
• DyDAn: will develop novel technologies to find patterns & relationships in dynamic, nonstationary, massive datasets.
• DyDAn: will produce pioneering educational programs to nurture homeland security workforce of the future
Slide 5
DyDAn Research
• Information Management and Knowledge Discovery
• Fundamental Topics in Discrete Mathematical Foundations
• Two research themes:– Analysis of Large, Dynamic Multigraphs– Continuous, Distributed Monitoring of
Dynamic, Heterogeneous Data
Slide 6
DyDAn Research I: Analysis of Large, Dynamic Multigraphs
• Need to understand interactions between entities: people, objects, groups
• Interactions often modeled as graphs– Linking nodes (entities) with edges (connections)
• Multiple relationships between entities suggests multigraphs
• Add new entities, new & changing connections suggests dynamic multigraphs.
• Develop methods to represent, analyze, interrogate, & navigate dynamic multigraphs.
Slide 7
DyDAn Research I: Analysis of Large, Dynamic Multigraphs
Slide 8
DyDAn Research II: Continuous, Distributed Monitoring of Dynamic,
Heterogeneous Data• Need to understand massive amounts of data.• Data inherently distributed (multiple sources)• Data arrives rapidly – “continuously”• Seek anomalies, patterns, “emerging events”• Run continuous queries to monitor incoming
data stream.• Data takes numerous forms; requires data
mining methods that span the modalities.
Slide 9
DyDAn Research Portfolio: Flexibility
• 9 initial projects, 5 in Area I, 4 in Area II
• Not all starting in year 1.
• All leverage off previous work and additional funding from Rutgers.
• Portfolio reviewed regularly with DHS, national lab partners, and other DHS centers; can readily change to newly-identified needs.
Slide 10
DyDAn Research Portfolio: Large Graphs Projects
• Universal Information Graphs (initial emphasis) • Adding Semantics to and Interconnecting
Semantic Graphs (initial emphasis)• Analyzing Large, Dynamic Multigraphs Arising
from Blogs• Algorithms for Identifying Hidden Social
Structures• Statistical and Graph-theoretical Approaches to
Time-Varying Multigraphs (Initial emphasis)
Slide 11
DyDAn Research Portfolio: Dynamic Data Projects
• Message Filtering and Entity Resolution• Continuous, Distributed Data Stream Modeling• Optimization and Data Analysis (Initial emphasis)• Dynamic Similarity Search in Multi-Modal Data
Slide 12
DyDAn Data
• Emphasis on publicly available data.• How to acquire, publish, analyze, store data in a
private, secure way.• Privacy-preserving data analysis.• How to generate synthetic data sets that have the
characteristics of real data but mask protected aspects.
• Director of Data Analysis will work on all aspects of acquiring, sharing, publicizing analyzing data: privacy, legal, technical, etc.
Slide 13
DyDAn Educational Programs• Great need to train people to work in homeland
security.
• Key DyDAn performers: record of integrating research and education from K-12 to postgraduate.
• Integration of research and education: students in all research projects.
• Integration of research and education: research themes into educational programs.
Slide 14
DyDAn Educational Programs
• Workshops, tutorials, shortcourses: most open to all• New courses, certificate programs, faculty training
– Repository for information about homeland security courses nationally
– New homeland security certificate programs: RPI, RU, TSU– Website to disseminate our models nationally– Program for national college faculty
• Extensive program of “research experiences for undergraduates.”– Students from around the US in residence at DyDAn
Slide 15
DyDAn Educational Programs
• Internships/Visits – by students/faculty to national labs, corporate
partner locations, and DyDAn. – by national lab, DHS, other UAC scientists to
DyDAn
• K-12 programs: – To build early awareness of educational and
career opportunities in homeland security– Annual high school teacher “short course” in
discrete math and homeland security
Slide 16
Leadership as a Coordinating UAC• Building on extensive experience managing large,
complex scientific & educational enterprises. – Based at DIMACS (Center for Discrete Mathematics and
Theoretical Computer Science).
– An original NSF “science and technology center”– 13 partner institutions (5 universities, 8 companies)– Large portfolio of research & educational programs with
international scope
Slide 17
A Resource for NJ• Connecting to the NJ Universities Homeland
Security Research Consortium: Seek to involve all Consortium universities
• Building on Relationships with State and Local Agencies
• Advisory Committee: State and National Representatives
• DyDAn Events open to NJ university, industry, and government participants and designed with their help.
• Connecting NJ to DHS officials and efforts nationally.
Slide 18
DyDAn Research• Information Management and Knowledge
Discovery• Fundamental Topics in Discrete Mathematical
Foundations• Two research themes:
– Analysis of Large, Dynamic Multigraphs– Continuous, Distributed Monitoring of
Dynamic, Heterogeneous Data
Slide 19
James Abello & Fred Roberts (Rutgers Univ.)
Kiran Chilakamarri (Texas Southern University)
Nate Dean(Texas State University- San
Marcos)
Project: Universal Information Graphs
Slide 20
Overview and Connection to Problems of Homeland Security
•A variety of different massive data sources are available to analysts: Web, Internet, Calls, Email, Transportation, …•Problem: Coordinate information from multiple sources, to identify “interesting” collaborative information networks.
Web Internet
Call DetailAir Traffic
AttackGraphs Market
Baskets
Slide 21
Overview and Connection to Problems of Homeland Security
•Model each data source as a large multidigraph•Edges give information•Too much information to actually fuse all these multidigraphs into one.•Challenge: Fuse collection of multidigraphs in useful ways.
.
Slide 22
Alex Borgida (Rutgers University)
Lila Ghemri (Texas Southern University)
Peter F. Patel-Schneider (Bell Labs Research)
Project: Adding Semantics to and Interconnecting Semantic Graphs
Slide 23
Overview and Connection to Problems of Homeland Security
• Information of interest to DHS is often stored using “shallow” representations.– Much of the information is in English tags– Susceptible to ambiguity, incompleteness, etc.– These representations are nonetheless very useful
• Alleviate such shallowly represented information by augmenting with rich ontologies that describe and prescribe how a domain works– can discover information inherent in shallow information– can expose inconsistencies in shallow information
• Problem - reasoning with rich information is computationally expensive.
Slide 24
Planned Work through DyDAn
• Extend OWL Web Ontology Language, a powerful ontology language for use with shallow information
• Extend and specialize theory of Distributed Description Logics (DDLs), designed to limit interactions to lessen computational load
• Develop and extend a highly-optimized reasoner to improve its performance with large amounts of shallow information
• Study how dynamic change interacts with reasoners
Slide 25
Project: Analysis of Large, Dynamic Multigraphs Arising
from BlogsJames Abello (Rutgers &
Ask.com) Graham Cormode
(AT&T Labs – Research)S. Muthukrishnan (Rutgers Univ.)
Slide 26
Multigraphs in Security Applications• Intelligence data is well-modeled by large, evolving
multigraphs– Nodes: entities Edges: connections
– Many links between same pair of entities denote different interactions at different times
– Relationships change (slowly, rapidly) over time.
• Examples:– (User IDs, emails/telephone calls),
– (Text reports/blogs/webpages, implicit/explicit links)
• Our research: acquiring and analyzing multigraphs from different applications.
Slide 27
Overview and Connection to Problems of Homeland Security
• Blogs are an example of open source data– Large, highly-interconnected source of timely posts
on observations, experiences, events, politics etc. from citizen observers (sloggers).
– Chaotic source of information. What (mis)information is being propagated?
– Challenging: find trustworthy sources.
• Goal: develop techniques for labeling multigraphs in intelligence applications, apply them to blogs.
Slide 28
Project: Algorithms for Identifying Hidden Social
Structures in Virtual Communities
Yuliy Baryshnikov (Bell Labs)
Mark Goldberg (RPI)
Malik Magdon-Ismail (RPI)
William (Al) Wallace (RPI)
Slide 29
Overview and Connection to Problems of Homeland Security
• Prior to their acting, the perpetrators discuss and plan using a variety of communication media.
• Challenge: Find hidden groups, coalitions and leaders by non-semantic analysis of large communication networks.
• Ideal result: Find a suspicious group based on its pre-event communication activity, before they act.
• Useful forensic result: Ex-post discovery of the relationship between the act and communication burst.
Slide 30
Project: Statistical and Graph-Theoretical Approaches to Time-
Varying Multigraphs
Colin Goodall, AT&T Labs – Research
Robert Bell, AT&T Labs – Research
David Madigan, Rutgers University
Slide 31
• A COI (Community of Interest) isan effective summary of significantconnections in a graph.
• Use COI for very large scale analysisof a dynamic graph:– Stark change in COI indicates an anomaly– Has an entity changed its id?– New cliques?
• Goal: To analyze and apply automated anomaly detection to COI’s of dynamic multigraphs in telecomm, blogs, and intelligence data.
Overview and Connection To Problems of Homeland Security
Slide 32
Project: Message Filtering and Entity Resolution
Endre Boros (Rutgers)
Lila Ghemri (TSU)
Tin Kam Ho (Bell Labs)
Paul Kantor (Rutgers)
David Madigan (Rutgers)
Richard Mammone (Rutgers)
Debasis Mitra (Bell Labs)
Slide 33
Continuous Message Filtering and Entity Resolution in the Distributed Environment
• Vast amount of data flow into or through monitoring points
• Chaff must be discarded, meaningful messages and patterns of messages must be detected– in real time
– with limited communication among the processing and monitoring nodes
– with minimal interruption of normal communication and privacy
– and maximum effectiveness
Slide 34
Overview of Research Problem
10 million messages a day. Billions of possible identifications
Multiple modeling and learning technologies
Multiple optimization and combination technologies
Thousands of potentially important messages, identifications, etc.
Operations
New Research• Work on
automatically learning to identify topics, events, and actors in messages
• Recognizing the same entity (an actor, a target, and organization) under different aliases.
Slide 35
Connection to Problems of Homeland Security
• Millions/ tens of millions of messages should be screened for patterns of interest
• Actors often hide behind false identities– Reveal themselves by
• Language• Style• Connections to other actors
• Goal is to maximize screening effectiveness, minimize false positives – disruption to individuals and to commerce
• Early, positive impact on detection of agents and organizations
Slide 36
Project: Continuous, Distributed Data Stream Monitoring
Moses Charikar (Princeton)
Graham Cormode (AT&T Labs – Research)
S. Muthukrishnan (Rutgers)
Slide 37
Overview and Connection to Problems of Homeland Security
• Data is massive, distributed, and evolving– inconvenient or impossible to collect together in one place– still need to monitor, identify patterns, correlate
• Example: – monitoring streams of text: emails, blogs, newsfeeds, field
reports. – identify patterns — profiles, clusters, outliers — that occur
across multiple sites
• Technical challenge: – must be accurate, avoid false positives. – optimize how (much) data is communicated between agents.
Slide 38
Alexandre d'Aspremont (Princeton)Yuliy Baryshnikov (Bell Labs)
Savas Dayanik (Princeton)Paul Kantor (Rutgers)
Kai Li (Princeton)Warren Powell (Princeton)
Seyed Roosta (TSU)
Project: Optimization and Data Analysis
Slide 39
Overview and Connection to Problems of Homeland Security
• Dynamic data raises optimization issues:– Has rate of messages sent by X changed?– Has flow of cash into/out of organization Y changed?– Has there been unexpected change in travel plans?
• View as learning problem; use optimal learning strategies.
• Research challenges: – Optimal detection of changes in signals– Optimal recursive estimation– Rapid classification/pattern identification
Slide 40
Project: Dynamic Similarity Search in Multi-Modal Data
Moses Charikar, Perry Cook, Kai Li, and Olga Troyanskaya
(Princeton)
Ken Clarkson, Tin Kam Ho, and Haobo Ren
(Bell Labs)
Slide 41
Overview and Connection to Problems of Homeland Security
• Data arising in homeland security comes from many modalities– Often such data are sensor data (audio, images, video, etc.)
which are noisy and require similarity match and similarity search
– Feature extractions are difficult and such features are high dimensional
• Multi-modal data of interest are massive– Current content-based similarity search and classification are
limited to small scale– “Curse of dimensionality”
Slide 42
Overview and Connection to Problems of Homeland Security
• How to build similarity search systems for multi-modal data is not well understood– How to manage and search at scale
– How to integrate annotations/attributes based search with content-based search
We are looking forward to collaborating with the DHS Institute for Discrete Sciences and to involving the NJ homeland security community in the new center