Upload
scarlett-hines
View
216
Download
0
Embed Size (px)
Citation preview
Session OutlineSession Outline
My Brief BackgroundEducation and Work Experiences
Ph.D. Thesis ResearchMessage-based MVC Architecture for
Distributed and Desktop Applications
Recent Research ProjectHigh Performance Multi-core Runtime
My Brief Background IMy Brief Background I
1987 ─ 1991 Computer Science program at Beihang University
CS was viewed a promising field to get into at the timeFour years of foundation courses, computer hardware & software courses, labs, projects, and internship. Programming languages used include assembly language, Basic, Pascal, Fortran 77, Prolog, Lisp, and C. Programming environment were DOS, Unix, Windows, and Macintosh.
1995 ─ 1998 Computer Science graduate program at Beihang University
Graduate Research Assistant at National Lab of Software Development EnvironmentParticipated in a team project SNOW (shared memory network of workstations) working on an improved algorithm of parallel IO subsystem based on two-phase method and MPI I/O.
1991 ─ 1998 Faculty at Beihang University Assistant Lecturer & Lecturer, teaching Database and Introduction to Computing courses.
My Brief Background IIMy Brief Background II1998 ─ 2000 M.S., Computer Information Science program at Syracuse University
2000 ─ 2005 Ph.D., Computer Information Science program at Syracuse University
The thesis project involved survey, designing, and evaluating a new paradigm for the next generation of rich media software applications that unifies legacy desktop and Internet applications with automatic collaboration and universal access capabilities. Attended conferences for presenting research papers and exhibiting projects
Awarded with Syracuse University Fellowship from 1998 to 2001 and Outstanding Graduate Student of College of Electrical Engineering and Computer Science in 2005
May 2005 ─ present Visiting Researcher at Community Grids Lab, Indiana University
June ─ November 2006 Software Project Lead at Anabas Inc.Analysis of Concurrency and Coordination Runtime (CCR) and Dynamic Secure Services (DSS) for Parallel and Distributed Computing
Message-based MVC (M-MVC)Message-based MVC (M-MVC)
Research Background
Architecture of Message-based MVC
Collaboration Paradigms
SVG Experiments
Performance Analysis
Summary of Thesis Research
Research BackgroundResearch BackgroundMotivations
CPU speed (Moore’s law) and network bandwidth (Gilder’s law) continue to improve bring fundamental changesInternet and Web technologies have evolved into a global information infrastructure for sharing of resourcesApplications getting increasingly sophisticatedInternet collaboration enabling virtual enterprises
Large-scale distributed computingRequires new application architecture that is adaptable to fast technology changes with properties such as simplicity, reusability, scalability, reliability, and performance
General area is technology support for Synchronous and Asynchronous Resource Sharing
e-learning (e.g. video/audio conferencing)e-science (e.g. large-scale distributed computing)e-business (e.g. virtual organizations)e-entertainment (e.g. online game)
Research on a generic model of building applicationsApplication domains
Distributed (Web)Service Oriented Architecture and Web Services
Desktop (Client)Model-View-Controller (MVC) paradigm
Internet collaborationHierarchical Web Service pipeline model
Architecture of Message-based MVCArchitecture of Message-based MVCA comparison of MVC, Web Service Pipeline, and Message-based MVC
Features of Message-based MVC ParadigmM-MVC is a general approach for building applications with a message-based paradigmIt emphasizes a universal modularized service model with messaging linkageConverges desktop application, Web application, and Internet collaboration
MVC and Web Services are fundamental architectures for desktop and Web applicationsWeb Service pipeline model provides the general collaboration architecture for distributed applicationsM-MVC is a uniform architecture integrating the above models
M-MVC allows automatic collaboration, which simplifies the architecture design
a. MVC Model b. Three-stage pipeline
Model View Controller
Controller
View
Display
Model
Messages contain control information
Decomposition of SVG Browser
High Level UI
Raw UI
Display
Rendering as messages
Events as messages
Semantic
Events as messages
Rendering as messages
Input port Output port
ViewUser Interface
Raw UIDisplay
ModelWeb Service
Sematic
High Level UI
Events as messages
Rendering as messages
Input port Output port
Messages contain control information
Message-based MVC
Collaboration Paradigms ICollaboration Paradigms ISMMV vs. MMMV as MVC interactive patterns
Flynn’s Taxonomy classifies parallel computing platforms in four types:SISD, MISD, SIMD, and MIMD.
SIMD– A single control unit dispatches instructions to each processing unit.
MIMD– Each processor is capable of executing a different program independent of the other processors. It enables asynchronous processing.
SMMV generalizes the concept of SIMD
MMMV generalizes the concept of MIMD
In practice, SMMV and MMMV patterns can be applied in both asynchronous and synchronous applications, thus form general collaboration paradigms
View n-1 View nView 1 View 2
Model
a) Single Model Multiple View
View n-1 View nView 1 View 2
Model m-1 Model mModel 1 Model 2
b) Multiple Model Multiple View
Collaboration Paradigms IICollaboration Paradigms II
Monolithic collaborationCGL applications of PowerPoint, OpenOffice and data visualization
Collaboration paradigms deployed with M-MVC modelSMMV (e.g. Instructor led learning)
MMMV (e.g. Participatory learning)
master
SVGbrowser
clientother
NaradaBrokeringNaradaBrokering
Identical programs receiving identical events
master
SVGbrowser
clientmaster master
SVGbrowser
clientother master
SVGbrowser
clientother
master client
View
NaradaBrokeringNaradaBrokering
other client
View
Modelas Web Service
other client
Viewother client
View
BrokerBroker
master client
Viewother client
Viewother client
Viewother client
View
NaradaBrokeringNaradaBrokering
Modelas WS
Modelas WS
Modelas WS
Modelas WS
BrokerBroker BrokerBroker BrokerBroker
SMMVMMMV
SVG Experiments ISVG Experiments I
Monolithic SVG ExperimentsCollaborative SVG Browser
Collaborative SVG Chess game
Players
Observers
SVG Experiments IISVG Experiments II
Decomposed SVG browser into stages of pipeline
T0: A given user event such as a mouse click that is sent from View to Model. T1: A given user event such as a mouse click can generate multiple associated DOM change events transmitted from the Model to the View. T1 is the arrival time at the View of the first of these. T2: This is the arrival of the last of these events from the Model and the start of the processing of the set of events in the GVT tree T3: This is the start of the rendering stage T4: This is the end of the rendering stage
T0
Machine B
Output
(Rendering)
Output
(Rendering)
Input
(UI events)
Input
(UI events)
GVT tree’GVT tree’
GVT treeGVT tree
Machine A
Event Processor
Event Processor
JavaScriptJavaScript
DOM tree (before mutation)
DOM tree (before mutation)
DOM tree’(after mutation)
DOM tree’(after mutation)
Machine C
Event Processor
Event Processor
T3 T1T4 T2
BrokerBroker
Notification service (NaradaBrokering)
Model (Service) View (Client)
Event Processor
Event Processor
Event Processor
Event ProcessorDOM tree’
(mirrored)
DOM tree’(mirrored)
DOM tree(mirrored)
DOM tree(mirrored)
T0
Performance Analysis IPerformance Analysis I
Average Performance of Mouse Events
Test Test scenarios
Mousedown events Average of all mouse events (mousedown, mousemove, and mouseup)
First return – Send time: T1-T0 (milliseconds)
First return – Send time: T1-T0
(milliseconds)
Last return – Send time: T’1-T0
(milliseconds)
End RenderingT4-T0 (microseconds)
No distance NB location
mean ± error
Stddev
mean ± error
stddev mean ± error
stddev
mean ± error
stddev
1 Switch connects
Desktop server
33.6 ± 3.0
14.8 37.9 ± 2.1 18.7 48.9± 2.7 23.7 294.0± 20.0
173.0
2 Switch connects
High-end Desktop server
18.0 ± 0.57
2.8 18.9 ± 0.89 9.07 31.0 ± 1.7 17.6 123.0 ± 8.9
91.2
3 Office area Linux server
14.9 ± 0.65
2.8 21.0 ± 1.3 10.2 43.9 ± 2.6 20.5 414.0 ± 24.0
185.0
4 Within-City (Campus area)
Linux cluster node server
20.0 ± 1.1
4.8 29.7 ± 1.5 13.6 49.5 ± 3.0 26.3 334.0 ± 22.0
194.0
5 Inter-City Solaris server
17.0 ± 0.91
4.3 24.8 ± 1.6 12.8 48.4 ± 3.0 23.3 404.0 ± 20.0
160.0
6 Inter-City Solaris server
20.0 ± 1.3
6.4 29.6 ± 1.7 15.3 50.5 ± 3.4 26.0 337.0 ± 22.0
189.0
Performance Analysis IIPerformance Analysis II
Immediate bouncing back event
Test Test scenarios
Bouncing back event
Average of all mouse events (mousedown, mousemove, and mouseup)
Bounce back – Send time: (milliseconds)
First return – Send time: T1-T0
(milliseconds)
Last return – Send time: T’1-T0
(milliseconds)
End RenderingT4-T0 (milliseconds)
No distance NB location mean ± error
Stddev
mean ± error
stddev
mean ± error
stddev
mean ± error
stddev
1 Switch connects
Desktop server
36.8 ± 2.7
19.0 52.1 ± 2.8 19.4 68.0 ± 3.7 25.9 405.0 ± 23.0
159.0
2 Switch connects
High-end Desktop server
20.6 ± 1.3
12.3 29.5 ± 1.5 13.8 49.5 ± 3.1 29.4 158.0 ± 12.0
109.0
3 Office area Linux server 24.3 ± 1.5
11.0 36.3 ± 1.9 14.2 54.2 ± 2.9 21.9 364.0 ± 22.0
166.0
4 Within-City (Campus area)
Linux cluster node server
15.4 ± 1.1
7.6 26.9 ± 1.6 11.6 46.7 ± 2.9 20.6 329.0 ± 25.0
179.0
5 Inter-City Solaris server
18.1 ± 1.3
8.8 31.8 ± 2.2 14.5 54.6 ± 4.9 32.8 351.0 ± 27.0
179.0
6 Inter-City Solaris server
21.7 ± 1.4
9.8 37.8 ± 2.7 19.3 55.6 ± 3.4 23.6 364.0 ± 25.0
176.0
Performance Analysis IIIPerformance Analysis III
Basic NB performance in 2 hops and 4 hops
Test 2 hops
(View – Broker – View)4 hops(View – Broker – Model – Broker – View)
milliseconds milliseconds
No mean ± error stddev mean ± error stddev
1 7.65 ± 0.61 3.78 13.4 ± 0.98 6.07
2 4.46 ± 0.41 2.53 11.4 ± 0.66 4.09
3 9.16 ± 0.60 3.69 16.9 ± 0.79 4.85
4 7.89 ± 0.61 3.76 14.1 ± 1.1 6.95
5 7.96 ± 0.60 3.68 14.0 ± 0.74 4.54
6 7.96 ± 0.60 3.67 16.8 ± 0.72 4.47
NB on Model; Model and View on two desktop 1.5 Ghz PCs; local switch network connection.
NB on View; Model and View on two desktop PCs with “high-end” graphics Dell (3 Ghz Pentium) for View; 1.5 Ghz Dell for model; local switch network connection.
Comparison of performance results to highlight the importance of the client
0 10 20 30 40 50 60 70 80 90 1000
2
4
6
8
10
12
14
16
minimum T1-T0 in milliseconds
num
ber
of e
vent
s in
5 m
illis
econ
d bi
ns
Message transit time in M-MVC Batik browser
Configuration:NB on View ; Model and View on tw o desktop PCs;local sw itch netw ork connection;NB version 0.97; TCP blocking protocol;normal thread priority for NB;JMS interface; no echo of messages f rom Model;
all eventsmousedown eventmouseup eventmousemove event
0 10 20 30 40 50 60 70 80 90 1000
5
10
15
20
25
30
35
40
minimum T1-T0 in milliseconds
nu
mb
er
of
ev
en
ts i
n 5
mill
ise
co
nd
bin
s
Message transit time in M-MVC Batik browser
Configuration:NB on View ; Model and View on tw o desktop PCs;local sw itch netw ork connection;NB version 0.97; TCP blocking protocol;normal thread priority for NB;JMS interface; no echo of messages f rom Model;
all eventsmousedown eventmouseup eventmousemove event
All EventsMousedownMouseupMousemove
All EventsMousedownMouseupMousemove
Time T1-T0 milliseconds
Events per 5 ms bin
Events per 5 ms bin
0 10 20 30 40 50 60 70 80 90 1000
2
4
6
8
10
12
14
16
18
20
minimum T1-T0 in milliseconds
num
ber
of e
vent
s in
5 m
illis
econ
d bi
ns
Message transit time in M-MVC Batik browser
Configuration:NB on View ; Model and View on tw o desktop PCs;local sw itch netw ork connection;NB version 0.97; TCP blocking protocol;normal thread priority for NB;JMS interface; no echo of messages f rom Model;
all eventsmousedown eventmouseup eventmousemove event
NB on local 2-processor Linux server; Model and View on two 1.5 Ghz desktop PCs; local switch network connection.
0 10 20 30 40 50 60 70 80 90 1000
5
10
15
minimum T1-T0 in milliseconds
num
ber
of e
vent
s in
5 m
illis
econ
d bi
ns
Message transit time in M-MVC Batik browser
Configuration:NB on View ; Model and View on tw o desktop PCs;local sw itch netw ork connection;NB version 0.97; TCP blocking protocol;normal thread priority for NB;JMS interface; no echo of messages f rom Model;
all eventsmousedown eventmouseup eventmousemove event
NB on 8-processor Solaris server ripvanwinkle; Model and View on two 1.5 Ghz desktop PCs; remote network connection through routers.
Comparison of performance results with Local and remote NB locations
All EventsMousedownMouseupMousemove
All EventsMousedownMouseupMousemove
Time T1-T0 milliseconds
Events per 5 ms bin
Events per 5 ms bin
ObservationsObservationsThis client to server and back transit time is only 20% of the total processing time in the local examples. The overhead of the Web service decomposition is not directly measured in tests shown these tablesThe changes in T1-T0 in each row reflect the different network transit times as we move the server from local to organization locations. This overhead of NaradaBrokering itself is 5-15 milliseconds depending on the operating mode of the Broker in simple stand-alone measurements. It consists forming message objects, serialization and network transit time with four hops (client to broker, broker to server, server to broker, broker to client). The contribution of NaradaBrokering to T1-T0 is about 30 milliseconds in preliminary measurements due to the extra thread scheduling inside the operating system and interfacing with complex SVG application. We expect the main impact to be the algorithmic effect of breaking the code into two, the network and broker overhead, thread scheduling from OSWe expect our architecture will work dramatically better on multi-core chipsFurther Java runtime has poor thread performance and can be made much faster
Summary of Thesis ResearchSummary of Thesis Research
Proposing an “explicit Message-based MVC” paradigm (M-MVC) as the general architecture of Web applications
Demonstrating an approach of building “collaboration as a Web service” through monolithic SVG experiments.
Bridging the gap between desktop and Web application by leveraging the existing desktop application with a Web service interface through “M-MVC in a publish/subscribe scheme”.
As an experiment, we convert a desktop application into a distributed system by modifying the architecture from method-based MVC into message-based MVC.
Proposing Multiple Model Multiple View (MMMV) and Single Model Multiple View (SMMV) collaboration as the general architecture of “collaboration as a Web service” model.
Identifying some of the key factors that influence the performance of message-based Web applications especially those with rich Web content and high client interactivity and complex rendering issues.
High Performance Multi-core RuntimeHigh Performance Multi-core RuntimeMulti-core Architecture are expected to be the future of “Moore’s Law” with single chip performance coming from parallelism with multiple cores rather than from increased clock speed and sequential architecture improvementsThis implies parallelism should be used in all applications and not just the familiar scientific and engineering areasThe runtime could be message passing for all cases. It is interesting to compare and try to unify runtime for MPI (classic scientific technology), Objects and Services which are all message basedWe have finished an analysis of Concurrency and Coordination Runtime (CCR) and DSS Service Runtime
Research Question: What is “core” multicore runtime and its performance?Research Question: What is “core” multicore runtime and its performance?
Many parallel and/or distributed programming models are a supported by a runtime consisting of long-running or dynamic threads exchanging messages
Those coming from distributed computing often have overheads of a millisecond or more when ported to multicore (See M-MVC thesis results earlier)
Need microsecond level performance on all models – like the best MPIExamination of Microsoft CCR suggests this will be possible
Current CCR spawning threads in MPI mode 2-4 microsecond overheadTwo-way service style messages around 30 microsecond
What are messaging primitives (adding to MPI) and what are their performance
Messaging Model Software Typical Applications
Streamed
Streamed dataflow; SOA
CCA, CCR, DSS Apache Synapse, Grid Workflow
Dataflow as in AVS, Image Processing; Grids; Web Services
Spawned
Tree Search CCR Optimization; Computer Chess
Queued Discrete Event simulations
openRTI, CERTI Ordered Transactions; “war game” style simulations
Rendezvous Message Parallelism MPI
openMPI MPICH2
Loosely Synchronous applications including engineering & science; rendering
Publish-Subscribe
Enterprise Service Bus
NaradaBrokering Mule, JMS
Content Delivery; Message Oriented Middleware
Overlay Networks Peer-to-Peer Jabber, JXTA, Pastry Skype; Instant Messengers
Intel Fall 2005 Multicore Roadmap
March 2006 Sun T1000 8 core Server and December 2006 Dell Intel-based 2 Processor, each with 4 Cores
Summary of CRR and DSS ProjectSummary of CRR and DSS ProjectCCR is a message based run time supporting interacting concurrent threads with high efficiency
Replaces CLR Thread Pool with Iteration
DSS is a Service (not a Web Service) environment designed for Robotics (which has many control and analysis modules implemented as services and linked by workflow)DSS is built on CCR and released by Microsoft We used a 2 processor 2-core AMD Opteron and a 2-processor 2-core Intel Xeon and looked at CCR and DSS performance
For CCR we chose message patterns similar to those used in MPIFor DSS we chose simple one way and two way message exchange between 2 services
This is first step in examining possibility of linking science and more general runtime and seeing if we can get very high performance in all casesWe see for example about 50 times better performance than Java runtime used in thesis
Implementing CCR Performance MeasurementsImplementing CCR Performance MeasurementsCCR is written in C# and we built a suite of test programs in this language
Multi-threaded performance analysis toolsOn the AMD machine, there is the free CodeAnalyst Performance Analyzer
It allows one see how work is assigned to threads but it cannot look at microsecond resolution needed for this work
Intel thread analyzer (VTune) does not currently support C# or Java
Microsoft Visual Studio 2005 Team Suite Performance Analyzer (no support WOW64 or x64 yet)
We looked at several thread message exchange patterns similar to basic Exchange and Shift in MPI
We took a basic computation whose smallest unit took about 1.4(AMD)-1.5(Intel) microseconds
We typically ran 107 such units on each core to take 14 or 15 seconds
We divided this run from 1 to 107 stages where at end of each stage the threads sent messages (in various patterns) to the next threads that continued computation
We measured total execution time as a function of number of stages used with 1 stage having no overheads
Typical Thread Analysis Data ViewTypical Thread Analysis Data View
Message
Thread3Port
3MessageMessage Message
Thread3Port
3MessageMessage
Message
Thread2Port
2MessageMessage Message
Thread2Port
2MessageMessage
Message
Thread0Port
0MessageMessage Message
Thread0Port
0MessageMessage Message
Thread0Port
0MessageMessage
Message
Thread3Port
3MessageMessage
Message
Thread2Port
2MessageMessage
Message
Thread1Port
1MessageMessage Message
Thread1Port
1MessageMessage Message
Thread1Port
1MessageMessage
One Stage
Pipeline which is Simplest loosely synchronous execution in CCRNote CCR supports thread spawning modelMPI usually uses fixed threads with message rendezvous
Message
Thread0Port
0MessageMessage Message
Thread0Port
0MessageMessage Message
Thread0Port
0MessageMessage
Message MessageMessage
Message MessageMessage
Message
Thread1Port
1MessageMessage Message
Thread1Port
1MessageMessage Message
Thread1Port
1MessageMessage
Next Stage
Message
Thread0Port
0MessageMessage
Thread0Message
Message
Thread3Port
3MessageMessage
Thread3
EndPort
Message
Thread2Port
2MessageMessage
Message
Thread2 Message
Message
Thread1Port
1MessageMessage
Thread1 Message
Idealized loosely synchronous endpoint (broadcast) in CCRAn example of MPI Collective in CCR
WriteExchangedMessages
Port3
Port2
Thread0
Thread3
Thread2
Thread1Port1
Port0
Thread0
WriteExchangedMessages
Port3
Thread2 Port2
Exchanging Messages with 1D Torus Exchangetopology for loosely synchronous execution in CCR
Thread0
Read Messages
Thread3
Thread2
Thread1Port1
Port0
Thread3
Thread1
Thread0
Port3
Thread2Port
2
Port1
Port0
Thread3
Thread1
Thread2Port
2
Thread0Port
0
Port3
Thread3
Port1
Thread1
Thread3Port
3
Thread2Port
2
Thread0Port
0
Thread1Port
1
(a) Pipeline (b) Shift
(d) Exchange
Thread0
Port3
Thread2Port
2
Port1
Port0
Thread3
Thread1
(c) Two Shifts
Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive
Stages (millions)
Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107 stages on HP Opteron Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode
Time Seconds
8.04 microseconds per stageaveraged from 1 to 10 millionstages
Overhead =Computation
Computation Component if no Overhead
4-way Pipeline Pattern4 Dispatcher ThreadsHP Opteron
0
20
40
60
80
100
120
140
160
0 2 4 6 8 10 12
Stages (millions)
Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107 stages on Dell Xeon Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode
Time Seconds
12.40 microseconds per stageaveraged from 1 to 10 millionstages
4-way Pipeline Pattern4 Dispatcher ThreadsDell Xeon
Overhead =Computation
Computation Component if no Overhead
Summary of Stage Overheads for AMD MachineSummary of Stage Overheads for AMD MachineThese are stage switching overheads for a set of
runs with different levels of parallelism and different message patterns –each stage takes about 28 microseconds (500,000 stages)
Number of Parallel Computations Stage Overhead (microseconds) 1 2 3 4 8
match 0.77 2.4 3.6 5.0 8.9 Straight Pipeline default 3.6 4.7 4.4 4.5 8.9
match N/A 3.3 3.4 4.7 11.0 Shift
default N/A 5.1 4.2 4.5 8.6
match N/A 4.8 7.7 9.5 26.0 Two Shifts default N/A 8.3 9.0 9.7 24.0
match N/A 11.0 15.8 18.3 Error Exchange
default N/A 16.8 18.2 18.6 Error
Summary of Stage Overheads for Intel MachineSummary of Stage Overheads for Intel MachineThese are stage switching overheads for a set of runs with
different levels of parallelism and different message patterns –each stage takes about 30 microseconds. AMD overheads in parentheses
These measurements are equivalent to MPI latenciesNumber of Parallel Computations Stage Overhead
(microseconds) 1 2 3 4 8
match 1.7 (0.77)
3.3 (2.4)
4.0 (3.6)
9.1 (5.0)
25.9 (8.9) Straight
Pipeline default 6.9 (3.6)
9.5 (4.7)
7.0 (4.4)
9.1 (4.5)
16.9 (8.9)
match N/A 3.4 (3.3)
5.1 (3.4)
9.4 (4.7)
25.0 (11.0) Shift
default N/A 9.8 (5.1)
8.9 (4.2)
9.4 (4.5)
11.2 (8.6)
match N/A 6.8 (4.8)
13.8 (7.7)
13.4 (9.5)
52.7 (26.0) Two
Shifts default N/A 23.1 (8.3)
24.9 (9.0)
13.4 (9.7)
31.5 (24.0)
match N/A 28.0 (11.0)
32.7 (15.8)
41.0 (18.3) Error
Exchange default N/A 34.6
(16.8) 36.1
(18.2) 41.0
(18.6) Error
AMD Bandwidth MeasurementsAMD Bandwidth Measurements• Previously we measured latency as measurements corresponded to small
messages. We did a further set of measurements of bandwidth by exchanging larger messages of different size between threads
• We used three types of data structures for receiving data– Array in thread equal to message size
– Array outside thread equal to message size
– Data stored sequentially in a large array (“stepped” array)
• For AMD and Intel, total bandwidth 1 to 2 Gigabytes/second
Bandwidths in Gigabytes/second summed over 4 cores
Array Inside Thread Array Outside
Threads Stepped Array Outside Thread
Number of stages
Small Large Small Large Small Large
Approx. Compute Time per stage µs
250000 0.90 0.96 1.08 1.09 1.14 1.10 56.0
0.89 0.99 1.16 1.11 1.14 1.13 2500
1.13 up to 107 words 56.0
1.19 1.15 5000
2800
1.15 1.13 200000
1.13 up to 107 words 70
Intel Bandwidth MeasurementsIntel Bandwidth Measurements• For bandwidth, the Intel did better than AMD especially when one
exploited cache on chip with small transfers
• For both AMD and Intel, each stage executed a computational task after copying data arrays of size 105 (labeled small), 106 (labeled large) or 107 double words. The last column is an approximate value in microseconds of the compute time for each stage. Note that copying 100,000 double precision words per core at a gigabyte/second bandwidth takes 3200 µs. The data to be copied (message payload in CCR) is fixed and its creation time is outside timed process
Bandwidths in Gigabytes/second summed over 4 cores Array Inside Thread Array Outside
Threads Stepped Array Outside Thread
Number of stages
Small Large Small Large Small Large
Approx. Compute Time per stage µs
250000 0.84 0.75 1.92 0.90 1.18 0.90 59.5
200000 1.21 0.91 74.4
1.75 1.0 5000
2970
0.83 0.76 1.89 0.89 1.16 0.89 2500
59.5
2500 1.74 0.9 2.0 1.07 1.78 1.06 5950
Typical Bandwidth measurements showing effect of cache with slope change5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore
Time Seconds
4-way Pipeline Pattern4 Dispatcher ThreadsDell Xeon
Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to 100,000 double words
Array Size: Millions of Double Words
Slope Change(Cache Effect)
0
50
100
150
200
250
300
350
1 10 100 1000 10000
Round trips
Av
era
ge
ru
n t
ime
(m
icro
se
co
nd
s)
Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release)
CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better
DSS Service Measurements
ReferencesReferencesThesis for download
http://grids.ucs.indiana.edu/~xqiu/dissertation.html
Thesis projecthttp://grids.ucs.indiana.edu/~xqiu/research.html
Publications and Presentationshttp://grids.ucs.indiana.edu/~xqiu/publication.html
NaradaBrokering Open Source Messaging Systemhttp://www.naradabrokering.org
Information about Community Grids Lab project and publications http://grids.ucs.indiana.edu/ptliupages/
Xiaohong Qiu, Geoffrey Fox, Alex Ho, Analysis of Concurrency and Coordination Runtime CCR and DSS for Parallel and Distributed Computing, technical report, November 2006Shameem Akhter and Jason Robert, Multi-Core Programming, Intel Press, April 2006