DISTRIBUTED AND INTERACTIVE SIMULATIONS AT LARGE SCALE FOR TRANSCONTINENTAL EXPERIMENTATION 19 October 2010 Dan M. Davis For Gottschalk, Yao Lucas, & Wagenbreth

DISTRIBUTED AND INTERACTIVE DISTRIBUTED AND INTERACTIVE SIMULATIONS ATSIMULATIONS AT LARGE SCALE FOR LARGE SCALE FOR

TRANSCONTINENTAL EXPERIMENTATIONTRANSCONTINENTAL EXPERIMENTATION

19 October 2010Dan M. DavisFor Gottschalk, Yao Lucas, & Wagenbreth [email protected](310)448-8434

Approved for public release; distribution is unlimited.

DS-RT 2010

DS-RT 2010Co-Authors

Prof. Robert F. Lucas, Dr. Ke-Thia Yao and Gene Wagenbreth

Information Sciences InstituteUniversity of Southern CaliforniaMarina del Rey, California 90292{ rflucas, kyao, genew } @isi.edu

Dr. Thomas D. Gottschalk

Center for Advanced Computing Research

California Institute of TechnologyMarina del Rey, California 91125

[email protected]

DS-RT 2010Overvie

w

Needs of user community at JFCOM

Three Technologies

General Purpose GPUs Implementation

High Bandwidth Studies with Interest

Management

Distributed Data Management

Lessons Learned

DS-RT 2010Theses

Transcontinentally Distributed Simulations require advanced technologies and innovative paradigmatic approaches

Academic Researchers often have such capabilities that are typically not yet in the journeyman programmers’ toolbox

GPGPUs provide useable acceleration, but making conservative assessments of potential speed-up may be warranted

Transcontinental High Performance Communications (“10 Gig”) are possible with Interest Management

With distributed systems and a data glut from High Performance Computing simulations, new approaches to Data Management are required

DS-RT 2010

JFCOM as a Distributed

Simulation User U.S. Joint Forces Command, Norfolk, Virginia One of DoD’s combatant commandsKey role in transforming defense capabilities

Two JFCOM Directorates using agent-based simulationsJ7 - Training

trains forcesdevelops doctrineleads training requirements analysis provides an interoperable training environment

J9 - Concept Development and Experimentationdevelops innovative joint concepts and capabilities provides proven solutions to problems facing the joint force

Simulations are typically GenSer Secret and characterized by:Interactive use by hundreds of personnel Distributed trans-continentally, but must be real timeVast majority of users at the terminals are uniformed warfighters

DS-RT 2010Plan View Display

DS-RT 20103D Stealth View

DS-RT 2010Simulation Federates

Agent-based models use rules for entity behaviorAutonomous-agent entitiesCan be Human-In-The-Loop (HITL) and run in real timeLarge compute clusters required to run large-scale

simulations

Standard interface is HLA RTI communication (IEEE 1516)Supplanted to old DISPublish/Subscribe modelUSC/Caltech Software Routers scale better

Common Codes in use at JFCOM:Joint Semi-Automated Forces (JSAF)“Culture”, stripped-down civilian instantiation of JSAFSimulating Location & Attack of Mobile Enemy Missiles

(SLAMEM)OneSAF

DS-RT 2010

Computing Infrastructure

Koa and Glenn Deployed, spring ’04

MHPCC & ASC-MSRC2 Linux Clusters, 32 bit256 nodes, 512 cores each

Joshua deployed 2008GPGPU-enhanced256 nodes, 1024 cores, 64 bit

DREN ConnectivityUsers in VA, CA & Army

basesApplication tolerates

network latency

Real-time interactive supercomputing

DS-RT 2010

Joshua GPGPU-Enhanced Cluster

Configuration

256 NodesNodes - (2) AMD Santa Rosa 2220 2.8 GHz dual-core

processors, 1024 cores totalGPUs - (1) NVIDIA 8800 Video CardNode Chassis - 4U chassisMemory - 16 GB DIMM DDR2 667 per node

GigE Inter-node CommunicationsDelivery to:

Joint Advanced Tactics and Training Laboratory (JATTL) in Suffolk, VA

DS-RT 2010

Perspective:Entity Growth vs.

Time

Nu

mb

er

an

d

Com

ple

xit

y o

f JS

AF

En

titi

es

JSAF/SPP Joshua (2008)

10,000,010,000,00000

UE 98-1

(1997)

JSAF/SPP Capability (2006)

JSAF/SPP Urban

Resolve (2004)

JSAF/SPP

Tests (2004)

J9901 (1999)

SAF Expres

s (1997)

3,600 3,600 12,000 12,000 107,00107,00

0 0

AO-00 (2000)

50,000 50,000

1,400,001,400,00

1,000,001,000,0000

250,000250,000

SPP Proof of Principle DARPA / Caltech

Experiments continue to require orders of magnitude larger &

more complex battlespaces

SCALEand FIDELITY

DC Clusters at MHPCC & ASCMSRC

DHPI GPU-

Enhanced Cluster

DS-RT 2010Why GPUs?

GPU performance can be 100X hostsDon’t forget Gene Amdahl’s Law; 2-3X

typical

This speed-up is expected to grow

Early SAF work (UNC. SAIC, USC)Line of Sight

Route Finding

Collision Detection

Sparse Matrix Factorization ISI verified similar bottlenecks in JSAFNew ideas for use in scenario generation

for new multi-spectral sensors

DS-RT 2010

Route Planning Performance Impact

Time Spent in Route Planning is Critical Bottleneck

DS-RT 2010

Benefits of GPGPU Computing

Joshua has provided many benefits; some are not easily quantified

Training, analysis or evaluation in cities otherwise off-limits due to:security issuespublic resistance to combat troops in their citydiplomatic about U.S. interest in cities of potential conflict

Joshua does save personnel costs, e.g. Army Division costs ~ $20M per day.

DHPI cluster can runs such a program using only ~100 technicians Cost saving may be ~$19.5M each day.

Good visibility with the leadership elite:Congressional visitsLieutenant General noted that it was probably the only time in his

career he would have an opportunity to command so large a unit

1,500 soldiers across the country participated, all connected

DS-RT 2010

Transcontinental Network High Bandwidth Research

The issue here was the potential exploitation of High Bandwidth Nets, e.g. as 10 Gig (10 GigaBit per Second) Nets

The nodes of this WAN were located at ISI-East in VirginiaUniversity of Illinois at ChicagoISI-West in California

Previous work indicated the utility of interest managed communications on cluster mesheshigh-bandwidth Local Area Networks (LANs) lower bandwidth WANs

Interest-limited message exchange was done using Caltech’s MeshRouter formalism

DS-RT 2010

Interest Managed Software Router Diagram

DS-RT 2010

Transcontinental Testbed for“10 Gig” Proof of Concept

DS-RT 2010

Tests on Local ISI Cluster

Prepared for the wide area tests Ran a number of generalizations of a various

configurations using the ISI-W clusterTests involved configurations with a

single router process on its own nodesingle publish process on its own nodemultiple subscriber processes on either

Performance results for these configurations are summarized in next slide

Bandwidth numbers reflect only data rates into the subscriber processes.

DS-RT 2010

Bandwidth results for Various Test Configurations

Number ofSubscribers

SubscriberNodes

Per NodeBandwidth

TotalBandwidth

RouterLoad

1 1 680 Mbit/sec 690 Mbit/sec 41% Busy

2 1 672 Mbit/sec 1.3 Gbit/sec 54% Busy





DS-RT 2010

Benchmark Tests: Two forms

Application processes Publish Processors

messages of specified length and interest statenominal total publication rate (Mbyte/sec) controlled by a data file

Subscribe Processors which

receives messages for a specified interest state, collects messages from multiple publishers

measure actual incoming message rates

Routers direct individual messages from publishers to subscribers according to the interest declarations

Router processes were instrumented to determine the fraction of time spent on management.

DS-RT 2010

Aggregate bandwidth versus Number of

Subscribers

DS-RT 2010

10 Gig WAN Test Results

WAN Tests Were then PerformedThis entailed a great deal of trouble-shooting

various routers along the WANA number of variants of the basic configuration

were explored:the number of distinct interest statesthe number of processors associated with a single router

processes) the number of replicas of the basic “Router plus

Associated Pub/Sub” nodes at each site

Typical performance numbers for a test with eight participating nodes at each of the UIC and ISI sites

Results are summarized on the next slide

DS-RT 2010

Performance Measures for Typical WAN Test

MessageLength

Client BW(bytes/sec)

Single MR BW(bytes/sec)

Aggregate BW(bits/sec)

0.4 KByte 3.2 M 16.0 M 1.0 G

0.8 KByte 6.4 M 32.0 M 2.1 G

1.6 KByte 12.8 M 64.0 M 4.1 G

2 KByte 14.3 M 71.5 M 4.6 G

100 KByte 0.8 M 4.0 M 0.3 G

DS-RT 2010

Final Results ~ 5 GigaBits Over a “10 Gig”

LineAggregate throughput is rather poorer for:

Individual message sizes that are too smallIndividual message sizes that are too large

This is consistent with experience on single SPP Constraints largely due to the nature of the RTI-s

communications primitives As long as near the optimal message size, the

aggregate bandwidth for the WAN test is ~ 4.6 Gbits/sec.

WAN tests varying configurations gives similar results:

Max Total WAN BW = 4.6 – 4.9 Gbit/sec

DS-RT 2010

Data Requirements

Larger ScaleGlobal scale vs. theatre

Higher FidelityGreater complexity in models of

entities(sensors, people, vehicles, etc.)

Urban OperationsMillions of civilians

All of the above produce dramatic increases in data relative to the previously experienced events.

DS-RT 2010

Terrain Large, but NOT Significant Data Issue

DS-RT 2010

Growth and the Impending

Data Armageddon JFCOM had immediate need for more entities (>10X)Limited Memory on Nodes and in TeslaTertiary Storage very limited

TeraByte a week with existing practiceKeeping only 20% of current dataNeed 10X more entitiesNeed 10X behavior improvement Net growth needed: almost three orders of magnitude

Now doing face validityNeed more quantitative, statistical

approach (future work at Caltech – Dr. Thomas Gottschalk)Data mining efforts now commencing

DS-RT 2010

Two Key Challenges

Collect the “fire hoses” of data generated by large-scale distributed sensor rich environmentsWithout interfering with communication

Without interfering with simulator performance

Maximally exploit the collected data efficientlyWithout overwhelming users

Without losing critical content

Goal: Unified distributed logging/analysis

infrastructure, which will help the users,not burden the computing/networking managers and not unduly tax the network traffic loads

DS-RT 2010

Limitation of the First System: Did not scale!

Two separate data analysis systemsOne for near-real time during the event

Another one for post event processing

For near-real timeToo much data to access over

wide-area network without crashing

For post event processing1-2 weeks to stage data to

centralized data store

Discards Green entities (80%)

DS-RT 2010

Handling Dynamic Data

Data is NOT static during runs, but users need to accessLogger continuously inserts new data from the simulation

Need distributed query to combine remote data sourcesDistributed logger inserts data into SDG data store at each site

ProblemsLocal cache invalid with respect to insertsCannot preposition data to optimize queries

ISI Strategy: explore trade-offsCompute on demand for better efficiency Compute on insert for faster queries Variable fidelity: periodic updatesDynamic pre-computation: detect frequent queries

DS-RT 2010

Handling Distributed Data

Analyze data in placeData is generated on distributed nodes

Leave data where it is generated

Distribute data access so data appears to be at a single site

Take advantage of HPC hardware capabilitiesLarge capacity data storage

High bandwidth network

Data archival

Exploit JSAF query characteristicsLimited number of joins

Counting/aggregation type queries

Data product is several orders of magnitude less than raw data

DS-RT 2010

Notional Diagram of

Scalable Data Grid

DS-RT 2010

Multidimensional Analysis:Sensor/Target Scoreboard

Summarizes sensor contact reportsPositive and negative sensor

detectionsDisplays two dimensional views

of the dataProvides three levels of drill-down

Sensor platform type vs. target class

Sensor platforms vs. sensor modes

List of contact reports

DS-RT 2010

Multidimensional Analysis

Raw data has other dimensions of potential interestDetection status

Time, location

Terrain type, entity density

Weather condition

Each dimension can be aggregated at multiple levelsTime: minutes, hours, days

Location: country, grid square

Collapse and expand multiple dimensions for viewing

sens

or

target

time

DS-RT 2010

* : any * : any

t : sensor platform type

p : sensor platform

c : targetclass

o : targetobject

m : sensormode

Multidimensional Analysis

* : any

*

t

p

c

tc

pc

o

to

po

m

tm

pm

cm

tcm

pcm

om

tom

pom

Sensor/Target Scoreboard drill-downs in the context multidimensional analysis

Data classified along 3 dimensions

Drill-down to 3 nodes in the dimensional lattice

Dimensions

Dimension Lattice

DS-RT 2010Cube Dimension Editor

DS-RT 2010

SDG: Broader Applicability?

Scalable Data Grid: a distributed data management application/middleware that effectively:Collects and stores high volumes of data at very high

data rates from geographically distributed sourcesAccesses, queries and analyzes the distributed dataUtilizes the distributed computing resources on HPCProvides a multidimensional framework for

viewing the data

Potential application areasLarge scale distributed simulationsInstrumented live training exercisesHigh volume instrumented physics research

Virtually any distributed data environment using HPC

DS-RT 2010K-Means Testing

K-Means is a popular data mining algorithm The K-Means algorithm requires three inputs:

- an integer k to indicate the number of desired clusters- a distance function over the data instances- the set of n data instances to be clustered.

Typically, a data instance represented as a vectorThe output of the algorithm is a set of k points

representing the mean of the k clustersEach of the n data instances is assigned to nearest

cluster mean based on the distance function

DS-RT 2010

K-Means Clustering of

Three Data Points

-2000

-1000

0

1000

2000

3000

4000

5000

-6000 -5000 -4000 -3000 -2000 -1000 0 1000 2000

Clustering of 3 Clusters

DS-RT 2010Conclusions

Large-scale Distributed Simulations demand more power Technology can to deliver that power This paper set out three emerging technologies

GPGPU Acceleration of Linux ClustersHigh Bandwidth Interest Managed CommunicationsDistributed Data Management

This work shows Simulation Researchers’ new abilitiesto generate simulationsto more effectively move the data around to more efficiently store the data to better analyze the data

Researchers would benefit from understanding the technologies’ powerlimitations costs

DS-RT 2010

Research Funded by JFCOM and AFRL

This material is based on research sponsored by the U.S. Joint Forces Command via a contract with the Lockheed Martin Corporation and SimIS, Inc., and on research sponsored by the Air Force Research Laboratory under agreement numbers F30602-02-C-0213 and FA8750-05-2-0204. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government. Approved for public release; distribution is unlimited.

Documents

DISTRIBUTED AND INTERACTIVE SIMULATIONS AT LARGE SCALE FOR TRANSCONTINENTAL EXPERIMENTATION 19 October 2010 Dan M. Davis For Gottschalk, Yao Lucas, & Wagenbreth