45
Parallel Programming in the Real World! Andrew Brownsword, Intel Niall Dalton Goetz Graefe, HP Labs Russell Williams, Adobe Moderator: Luis Ceze, UWCSE

Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

  • Upload
    dodat

  • View
    220

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Parallel  Programming  in  the  Real  World!  

•  Andrew  Brownsword,  Intel  •  Niall  Dalton  •  Goetz  Graefe,  HP  Labs  •  Russell  Williams,  Adobe  

•  Moderator:  Luis  Ceze,  UW-­‐CSE  

Page 2: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Andrew Brownsword

MIC SW Architect

Intel Corporation

Page 3: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

“ ” software is…

Large, unwieldy, and long lived

Much longer & larger than intended

especially by the authors!

Written by many people

Of widely ranged skills, styles, agendas, experiences

Many have moved on to [next_task..next_life]

Hard to change

Even with language-aware tools

Page 4: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

“ ” software is…

Expressed in and defined by programming

models

Many levels of abstraction, concreteness,

explicitness, constraints

○ Best balance depends on goals & changes over time

Compilers only understand the programming

model…

…not the abstractions built in it

Page 5: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

“ ” parallel software is…

Supposed to be fast

But performance is extremely fragile

Supposed to be robust

Data races, deadlocks, livelocks, etc

Composability

Supposed to be maintainable

Critical aspects often hidden in the details

Page 6: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

1. Brief survey of the “ ” landscape

2. Implications for programming models

Page 7: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Games

Page 8: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Games – code & platform

Medium-to-large codebases 50K-5M lines of code, largely C++

May have large tool chains & online infrastructure

Many diverse sub-systems running at once “Soft” real-time, broad range of data set sizes

Frame-oriented scheduling (mostly)

Many sequencing dependencies between tasks

Target hardware “is what it is” Phones to servers, performance is critical

Multi-core, heterogeneous, SIMD, GPU, networks

Page 9: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Games – many systems

Graphics

Environment

Audio

Animation

Game logic

AI & scripting

Physics

User Interface

Inputs

Network

I/O (streaming)

Data conversion / processing

Note that this slide is a gross over-simplification!

Page 10: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Games – dev process

Short development cycles Severe code churn, high pressure to deliver quickly

Middleware & game “engine” use common

Rapidly changing feature requirements Fast iteration during development is critical

Code & architecture maintenance nightmare

Substantial volume & variety of media data Content team size greatly exceeds engineering’s

Porting between diverse platforms is common

Page 11: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

HPC

Page 12: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

HPC – code & platform

Widely varying codebases & domains 10K-10M+ lines of code

Diverse programming models ○ Fortran, C/C++, MPI, OpenMP dominate

Few kernels*, BIG data* Correctness & robustness are critical*

Epic, titanic, gargantuan data sets*

Varied target hardware Workstations to large clusters

Often purchased for the application

* Usually

Page 13: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

HPC – dev process

On-going development cycles Code is generally never re-written, lasts for decades

Huge, poorly understood legacy code

Heavy library use (math, solvers, communications, etc.)

Correctness, verifiability, robustness Dependability of results is crucial

Very, very long running times are common

Portability across generations & platforms Tuning ‘knobs’ exposed rather than changing the code

Outlast HW, tools, vendors, prog. models, authors

Page 14: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Programming Models

Page 15: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Programming Models

Design of the model has formative impact on software written in it Abstractions to avoid over-specifying details

Concrete to allow control over solution

Explicit to keep critical detail visible

Constraints to allow effective optimization

For parallelism: Top desirable attributes

Top factors to address

Page 16: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Desirable Attributes…

Page 17: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Integration

With other models: Existing model, enables gradual adoption

Peer models, there is no single silver bullet

Layered models, enables DSLs & interop

With runtimes: Interaction & interop within processes

Resource management (processors, memory, I/O)

With tools: Build systems, analysis tools, design tools

Debuggers, profilers, etc.

Page 18: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Portability

Hardware, OS & vendor independence

Standard, portable models live longer

Investment in software is very expensive Re-writing is often simply not an option

Even seemingly small changes can be extraordinarily expensive

○ Testing & validation costs

○ Architectural implications

Page 19: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Composability

Real software is large and complex

Built by many people

Built out of components

Subject to intricate system level behaviours

Programming models must facilitate and

support these aspects

Page 20: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Factors To Address…

Page 21: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Concurrent Execution

Multiple levels to achieve performance

Vectorization (SIMD & throughput optimization)

Parallelization (multi/many-core)

Distributed (cluster-level)

Internet (loosely coupled, client/server, cloud services)

Programming model needs to express each level

Each level brings >10x potential

Cannot afford a different decomposition at each level

Page 22: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Data Organization & Access

FLOPS are cheap, bandwidth is not Severe and worsening imbalance

No sign of this changing

Optimizing data access is usually key to achieving performance & power

Existing models do very little to address this Access patterns usually implicit, layouts explicit

Changing data layout requires changing code

Different hardware, algorithms & models demand different layouts

Page 23: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Specialization

Hardware is diversifying

Heterogeneous processors (CPU, GPU, etc)

Fixed function hardware

System-on-chip

Driven by power & performance

Tight integration needed for fine-grained

interactions & data

Page 24: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

NYSE  TAQ  record  counts  

Page 25: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

An  elegant  weapon..    for  a  more  civilized  age  

Page 26: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Trade  lifecycle  

•  Strategy Development & Testing

•  Strategy Deployment & Management

• Data Storage & Analysis

•  Post Trade Analysis & Compliance

- Trade History - Exchange latency

- Historical DB - Data Capture

- Data Publishing

- Analytics

- Back Testing

- Optimization

- Research Systems

- Matching - Execution

- Risk Management

Page 27: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Follow  the  data  

!"#$%&'()*&+,(-."/&001+2!"#$!"%##$&'(#$)'*'$+,-".'/

!"#$%$&'("'#)*&*+('%#,-#.,%)/0("'#1*2$+#,"#*"#34$"0#5&(4$"#!&.6(0$.0/&$#735!8#06*0#

,--$&2#06$#*1(9(0:#0,#*"*9:;$#$<0&$%$9:#9*&'$#*%,/"02#,-#$4$"0#7*"+#,06$&8#+*0*#-&,%#

+(2)*&*0$#2,/&.$2#=(06#4$&:#9,=#9*0$".:#*"+#6('6#06&,/'6)/0

34 0 1234'(/$12(5,6#(*,'7$0 )#8#39#%$:;$<=:=

>*%#'3,(-$)'*'$?('7/*,8@A12347#B$CD#(*$E%28#@@,(-F

G(HI#32%/$?('7/*,8@AG(HI#32%/$)'*'9'@#F

I'@@$+,@*2%,8$6'*'$'('7/*,8@A127J3('%$)'*'$>*2%#F

>*%#'3

6'*'

K4@$6'*'

>J9HI,77,@#82(6

I,(J*#L+2J%

>#82(6

Page 28: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Different  cores  for  different  chores    

FPGAs  

GPUs  

Many  Cores  

Low-­‐Power  SOCs  

Ease  of  P

rogram

ming  

ApplicaJon-­‐Specific   Broad  Applicability  

Page 29: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Don’t  fight  the  last  war  

The  Era  of  the    CORE  WARS  The  Era  of  the    

CLOCK  WARS  

The  Era  of  the    EFFICIENCY  WARS  

1990                                                            2000                                                          2010                                                      2020    

Page 30: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

SoOer  HW  or  harder  SW?  

!

!

!

PCI Express Card comprising: • 4 SFP/SFP+ cages providing flexible 1/10Gbps

connectivity • Tilera TILE-Gx8036™ Processor (64-bit, 36

cores, 1.2GHz) - two dedicated 1600MHz DDR3 SODIMMs

• Altera® Stratix® IV FPGA - 531,200 Eq. LEs – One dedicated DDR3 SODIMM

• High-precision (OCXO) Oven-compensated Quartz Oscillator providing ultra-accurate timestamping capability

• Generation 2 (5Gbps) x8 PCI Express bus providing 40Gbps between card and host

Key Features: • Optimised ultra-low-latency path between the

“wire” and the CPU • FPGA is inline between card network interfaces

and CPU allowing it to process/manipulate raw frames in-line with the CPU’s network interfaces

• FPGA and CPU are also connected by a dedicated 20Gbps PCI Express bus allowing efficient DMA transfer between them

• Low power footprint of <75W allowing multiple cards to be used in a single server if required

• Tilera’s Multicore Development Environment™ (MDE) included with Accensus user-space libraries

Accensus also offers a subscription service to a range of Exchange Line Handlers designed and implemented to run on the TELAS 2™ card. These are implemented on the FPGA thus leaving all CPU cores available for application development. Each Line Handler provides full packet to message decode, line arbitration and transformation to optimised C structs. Also planned for early 2012 are a user-space TCP/IP stack with FPGA offload optimised for order management systems and advanced time stamping functionality. Please contact [email protected] for further information

!

OCXO!

CPU!DDR3!

SODIMM

!

DDR3!

SODIMM

!SFP/SFP+!SFP/SFP+! FPGA! DD

R3!

SODIMM

!

SFP/SFP+!

SFP/SFP+!

PCI!Express!

TELAS 2™ Card

Product Brief

An ultra-low-latency trading system on a card TELAS 2™ (Trading Engine Latency-reduction and Acceleration System) is a revolutionary new product from Accensus. It combines flexible 1/10Gbps network connectivity with a powerful in-line Altera® Stratix® IV FPGA and the new Tilera TILE-Gx8036™ 36-core processor running Linux along with up to 24GB (8GB FPGA, 16GB Processor) of DDR3 RAM on a single PCI Express card.

The TELAS 2™ architecture is designed from the outset to be ultra-low-latency allowing data to be received from the wire at up to 40Gbps, optionally processed in-line by the FPGA chip then delivered to the processor where it is processed by applications written in C/C++ running on an optimized Linux OS. The data can remain on the card for processing or be transferred to the host if required. The transmit path is simply the reverse. This architecture allows application developers the complete freedom to implement ultra-low-latency applications either in C/C++ or in VHDL/Verilog or split across both if so desired. User applications may reside completely on the card or be split across the card and the host server.

TELAS 2™ is built to be installed in commodity servers and provide both a significantly lower latency network path to and from the card CPU than the that of the server and supplement server capacity by providing significant additional compute and memory resource thus extending existing rack space and power utilisation.

© 2011 Accensus LLC, 200 South Wacker Drive, Chicago, IL 60606, USA

!

!

!

PCI Express Card comprising: • 4 SFP/SFP+ cages providing flexible 1/10Gbps

connectivity • Tilera TILE-Gx8036™ Processor (64-bit, 36

cores, 1.2GHz) - two dedicated 1600MHz DDR3 SODIMMs

• Altera® Stratix® IV FPGA - 531,200 Eq. LEs – One dedicated DDR3 SODIMM

• High-precision (OCXO) Oven-compensated Quartz Oscillator providing ultra-accurate timestamping capability

• Generation 2 (5Gbps) x8 PCI Express bus providing 40Gbps between card and host

Key Features: • Optimised ultra-low-latency path between the

“wire” and the CPU • FPGA is inline between card network interfaces

and CPU allowing it to process/manipulate raw frames in-line with the CPU’s network interfaces

• FPGA and CPU are also connected by a dedicated 20Gbps PCI Express bus allowing efficient DMA transfer between them

• Low power footprint of <75W allowing multiple cards to be used in a single server if required

• Tilera’s Multicore Development Environment™ (MDE) included with Accensus user-space libraries

Accensus also offers a subscription service to a range of Exchange Line Handlers designed and implemented to run on the TELAS 2™ card. These are implemented on the FPGA thus leaving all CPU cores available for application development. Each Line Handler provides full packet to message decode, line arbitration and transformation to optimised C structs. Also planned for early 2012 are a user-space TCP/IP stack with FPGA offload optimised for order management systems and advanced time stamping functionality. Please contact [email protected] for further information

!

OCXO!

CPU!DDR3!

SODIMM

!

DDR3!

SODIMM

!

SFP/SFP+!SFP/SFP+! FPGA! DD

R3!

SODIMM

!

SFP/SFP+!

SFP/SFP+!

PCI!Express!

TELAS 2™ Card

Product Brief

An ultra-low-latency trading system on a card TELAS 2™ (Trading Engine Latency-reduction and Acceleration System) is a revolutionary new product from Accensus. It combines flexible 1/10Gbps network connectivity with a powerful in-line Altera® Stratix® IV FPGA and the new Tilera TILE-Gx8036™ 36-core processor running Linux along with up to 24GB (8GB FPGA, 16GB Processor) of DDR3 RAM on a single PCI Express card.

The TELAS 2™ architecture is designed from the outset to be ultra-low-latency allowing data to be received from the wire at up to 40Gbps, optionally processed in-line by the FPGA chip then delivered to the processor where it is processed by applications written in C/C++ running on an optimized Linux OS. The data can remain on the card for processing or be transferred to the host if required. The transmit path is simply the reverse. This architecture allows application developers the complete freedom to implement ultra-low-latency applications either in C/C++ or in VHDL/Verilog or split across both if so desired. User applications may reside completely on the card or be split across the card and the host server.

TELAS 2™ is built to be installed in commodity servers and provide both a significantly lower latency network path to and from the card CPU than the that of the server and supplement server capacity by providing significant additional compute and memory resource thus extending existing rack space and power utilisation.

© 2011 Accensus LLC, 200 South Wacker Drive, Chicago, IL 60606, USA

!

!

!

PCI Express Card comprising: • 4 SFP/SFP+ cages providing flexible 1/10Gbps

connectivity • Tilera TILE-Gx8036™ Processor (64-bit, 36

cores, 1.2GHz) - two dedicated 1600MHz DDR3 SODIMMs

• Altera® Stratix® IV FPGA - 531,200 Eq. LEs – One dedicated DDR3 SODIMM

• High-precision (OCXO) Oven-compensated Quartz Oscillator providing ultra-accurate timestamping capability

• Generation 2 (5Gbps) x8 PCI Express bus providing 40Gbps between card and host

Key Features: • Optimised ultra-low-latency path between the

“wire” and the CPU • FPGA is inline between card network interfaces

and CPU allowing it to process/manipulate raw frames in-line with the CPU’s network interfaces

• FPGA and CPU are also connected by a dedicated 20Gbps PCI Express bus allowing efficient DMA transfer between them

• Low power footprint of <75W allowing multiple cards to be used in a single server if required

• Tilera’s Multicore Development Environment™ (MDE) included with Accensus user-space libraries

Accensus also offers a subscription service to a range of Exchange Line Handlers designed and implemented to run on the TELAS 2™ card. These are implemented on the FPGA thus leaving all CPU cores available for application development. Each Line Handler provides full packet to message decode, line arbitration and transformation to optimised C structs. Also planned for early 2012 are a user-space TCP/IP stack with FPGA offload optimised for order management systems and advanced time stamping functionality. Please contact [email protected] for further information

!

OCXO!

CPU!DDR3!

SODIMM

!

DDR3!

SODIMM

!

SFP/SFP+!SFP/SFP+! FPGA! DD

R3!

SODIMM

!

SFP/SFP+!

SFP/SFP+!

PCI!Express!

TELAS 2™ Card

Product Brief

An ultra-low-latency trading system on a card TELAS 2™ (Trading Engine Latency-reduction and Acceleration System) is a revolutionary new product from Accensus. It combines flexible 1/10Gbps network connectivity with a powerful in-line Altera® Stratix® IV FPGA and the new Tilera TILE-Gx8036™ 36-core processor running Linux along with up to 24GB (8GB FPGA, 16GB Processor) of DDR3 RAM on a single PCI Express card.

The TELAS 2™ architecture is designed from the outset to be ultra-low-latency allowing data to be received from the wire at up to 40Gbps, optionally processed in-line by the FPGA chip then delivered to the processor where it is processed by applications written in C/C++ running on an optimized Linux OS. The data can remain on the card for processing or be transferred to the host if required. The transmit path is simply the reverse. This architecture allows application developers the complete freedom to implement ultra-low-latency applications either in C/C++ or in VHDL/Verilog or split across both if so desired. User applications may reside completely on the card or be split across the card and the host server.

TELAS 2™ is built to be installed in commodity servers and provide both a significantly lower latency network path to and from the card CPU than the that of the server and supplement server capacity by providing significant additional compute and memory resource thus extending existing rack space and power utilisation.

© 2011 Accensus LLC, 200 South Wacker Drive, Chicago, IL 60606, USA

Page 31: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Data  reducJon  

target NIC. For verification, we have used the NEC Japan stock information (see Section II) as events. The date of each event is replaced by a sequence number in order to continuously input events at the cycle level. As shown in Fig. 10, the logic successfully detects five change points, and it achieves 20Gbps event processing performance; the clock frequency is 156MHz and the data width is 128bits (i.e.,156MHz x 128bit =19,968Mbps). It should be noted that it is difficult for SQL-based systems to detect such change points without having support for procedural languages.

6 10 1315 31

Smoothing &Change-point analysis

Pric

e (Y

en)

220225230235240245250255260265

smoothed data

215time

NIC

Figure 10. Implementation of our motivating example.

B. Logic Usage We have evaluated how much the above implementation

increases slice logic utilization of our NIC FPGA. Since the number of occupied slices increases only 2.7% (see Table II), the entire logic, including our event processing adapter, can be efficiently implemented.

TABLE II. LOGIC USAGE IN OUR MOTIVATING EXAMPLEItem Available Increase

Number of slice registers 207,360 +2,492 (1.2%) Number of slice LUTs 207,360 +4,290 (2.1%)

Number of occupied slices 51,840 +1,378 (2.7%)

VI. CONCLUSION

The requirements for fast complex event processing will necessitate hardware acceleration which uses reconfigurable devices. Key to the success of our work is logic automation generated with our C-based event language. With this language, we have achieved both higher event processing performance and higher flexibility for application designs than those with SQL-based CEP systems. We have demonstrated the world’s fastest (20Gbps) event processing performance in a financial trading application on an FPGA-based NIC. In future work, we intend to employ partial reconfiguration for run-time function replacement.

ACKNOWLEDGMENTS

We thank T. Iihoshi, K. Ichino, O. Itoku, S. Kamiya, S. Karino, S. Morioka, A. Motoki, M. Nishihara, M. Petersen, H. Tagato, M. Tani, A. Tsuji, and N. Yamagaki for their great contributions.

REFERENCES[1] http://avid.cs.umass.edu/sase/index.php?page=navigation_menus [2] D. J. Abadi, et al., “Aurora: a new model and architecture for data stream

management,” Intl. J. on Very Large Data Bases, vol. 12, issue 2, 120-139, Aug. 2003.

[3] D. J. Abadi, et al., “The design of the Borelias stream processing engine,” Biennial Conf. on Innovative Data Systems Research, 277-289, Jan. 2005.

[4] D. Anicic, et al., “A rule-based language for complex event processing and reasoning,” Intl. Conf. on Web Reasoning and Rule Systems, 42-57, Sept. 2010.

[5] S. Chandrasekaran, et al., “TelegraphCQ: continuous dataflow processing for an uncertain world,” Biennial Conf. on Innovative Data Systems Research, 269-280, Jan. 2003.

[6] A. Demers, et al., “Cayuga: a general purpose event monitoring system,” Biennial Conf. on Innovative Data Systems Research, 412-422, Jan. 2007.

[7] B. Gedik, et al., “SPADE: the System S declarative stream processing engine,” ACM Intl. Conf. on Management of Data, 1123-1134, June 2008.

[8] D. Gyllstrom, J. Agrawal, Y. Diao, and N. Immerman, “On supporting kleene closure over event streams,” Intl. Conf. on Data Engineering, 1391-1393, April 2008.

[9] J. Kraemer, and B. Seeger, “PIPES-a public infrastructure for processing and exploring streams,” ACM Intl. Conf. on Management of Data, 925-926, June 2004.

[10] J. Naughton, et al., “The Niagara Internet query system,” IEEE Data Engineering Bulletin, vol. 24, num. 1, 27-33, March 2001.

[11] The STREAM Group, “STREAM: The Stanford stream data manager,” IEEE Data Engineering Bulletin, vol. 26, num. 1, 19-26, March 2003.

[12] M. R. Mendes, P. Bizarro, and P. Marques, “A performance study of event processing systems,” Performance Evaluation and Benchmarking, vol. 5895, 221-236, 2009.

[13] L. Woods, J. Teubner, and G. Alonso, “Complex event detection at wire speed with FPGAs,” Intl. Conf. on Very Large Data Bases, vol. 3, issue 1-2, 660-669, Sept. 2010.

[14] OPRA, “Updated traffic projections 2011 & 2012,” http://www. opradata.com/ specs/upd_traffic_proj_11_12.pdf.

[15] R. Mueller, J. Teubner, and G. Alonso, “Streams over wires – a query compiler for FPGAs,” Intl. Conf. on Very Large Data Bases, vol. 2, issue 1, 229-240, Aug. 2009.

[16] NEC CyberWorkBench. http://www.nec.com/global/prod/cwb/ [17] F. Zemke, A. Witkowski, and M. Cherniak, “Pattern matching in sequences of

rows,” ANSI Standard Proposal, March 2007. [18] R. B. Mandelbrot, and R. L. Hudson, “The (mis)behavior of markets – a fractal

view of risk, ruin, and reward,” Basic Books, March 2006. [19] R. Sidhu, and V. Prasanna, “Fast regular expression matching using FPGAs,”

IEEE Symp. on Field-Programmable Custom Computing Machines, 227-238, April 2001.

[20] Z. K. Baker, and V. K. Prasanna, “A methodology for the synthesis of efficient intrusion detectoin systems on FPGAs,” IEEE Symp. on Field Programmable Custom Computing Machine, 135-144, April 2004.

[21] F. Bruschi, M. Paolieri, and V. Rana, “A reconfigurable system based on a parallel and pipelined solution for regular expression matching,” IEEE Intl. Conf. on Field-Programmable Logic and Applications, 44-49, Sept. 2010.

[22] Y. H. Cho, S. Navab, and W. H. M.-Smith, “Specialized hardware for deep network packet filtering,” Intl. Conf. on Field Programmable Logic and Applications, 452-461, Sept. 2002.

[23] C. R. Clark, and D. E. Schimmel, “Efficient reconfigurable logic circuits for matching complex network intrustion detection patterns,” Intl. Conf. on Field Programmable Logic and Applications, 956-959, Sept. 2003.

[24] B. L. Hutchings, R. Franklin, and D. Carver, “Assisting network intrusion detection with reconfigurable hardware,” IEEE Symp. on Field-Programmable Custom Computing Machine, 111-120, April 2002.

[25] C.-H. Lin, C.-T. Huang, C.-P. Jiang, and S.-C. Chang, “Optimization of pattern matching circuits for regular expression on FPGA,” IEEE Trans. on Very Large Scale Integration Systems, vol 15, no. 12, 1303-1310, Dec. 2007.

[26] N. Yamagaki, R. Sidhu, and S. Kamiya, “High-speed regular expressin matching engine using multi-character NFA,” IEEE Intl. Conf. on Field Programmable Logic and Applications, 131-136, Sept. 2008.

[27] Altera, “Atlantic interface specification,” ver 3.0, June 2002.

+%)

Page 32: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Air Vents Console Port

Management Port USB Port 16  Base  SFP/SFP+  Ports   8  FX  SFP/SFP+  Ports  

Clock Input

Page 33: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

somedata  =  (1,2,3);  otherdata  =  [1,2,3];  dict  =  [`a=1,  `b=2];  //  Note  these  are  different  types,  list  vs.  vector.  type  area  =  `FX  |  `EquiJes  Int  |  `FixedIncome  Double  Double    results@node0  with  f  =  #(id  ::  symbol;  profit  ::  double)      f  =  {(id,  area,  pnl)        var  profit  =  area?  `FX  :  pnl*.98  |  `EquiJes  x  :  pnl-­‐x  |    `FixedIncome  x  y  :  pnl*(x-­‐y);        insert  (id,  profit)  into  results  }    jobs  =  select  id,  area,  parameters  from  strategies  where  date==today()    simulate  =  {(job)        var  pnl  =  sum(random  *  1..10);        if  (pnl  >  100)  {send  (job.id,  job.  area,  pnl)  to  results;  (`ok,  pnl)}        else  (`fail,  pnl)  }    job_status  =  @[select  disJnct  processors  from  places]  {      <-­‐[(x){begin  simulate(x)}  each  jobs]  }    failed_jobs  =  select  (status,  pnl)  from  job_status  where  status==`fail  

Page 34: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

//  run  some  code  in  place  A;  block  unJl  it’s  done  @A  {code}    //  start  an  acJvity  f  in  place  B  and  return  immediately  @B  begin  {f}    //  run  some  code  in  place  C,  taking  ownership  of  data  @c  with  data  {…}  //  bind  data  to  place    //  distribute  data  over  place1  and  place  2  @[place1,  place2]  data;      //  redundant  copies  of  data  in  place1  and  place  2;  also  works  for  redundant  computaJon  @[place1],  [place2]  data;    //  Run  f  in  the  fastest  place  we  can  var  c  =  select  core-­‐id  from  processors  where  max  frequency  @c  {f(`somedata)}    //  Queue  work  in  parent  @parent  {code}    //  Reply  @reply  {code}  

Page 35: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Open  Problems  

•  Machines  are  already  beyond  our  ability  to  program  producJvely  with  high  performance  

•  It’s  gevng  harder  to  observe,  understand,  debug  &  tune  our  broken  programs/machines  

•  Where  is  the  inconsistency  coming  from?  •  What  implicit  effects  are  we  suffering  from?  •  How  do  we  cope  with  increasing  diversity?  •  What  do  we  need  to  give  up  to  get  some  help?  

Page 36: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

Hot topics in parallelism in data management

Goetz Graefe Hewlett-Packard Laboratories

Palo Alto, Cal. – Madison, Wis.

Page 37: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

June 8, 2012 2

History •  Concurrency among independent transactions

Each transaction single-threaded 1960s, 1970s, …

•  Parallel query processing (within a transaction) Teradata 1983-84 specialized hardware Gamma 1984-88 off-the-shelf hardware Pipelines for algebraic execution Partitioning intermediate results

Page 38: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

June 8, 2012 3

Transactions ACID = atomicity, consistency, isolation, durability •  User transactions

Database contents queries & updates Locks held to transaction commit Rollback using recovery log

•  System transactions Database representation changes, e.g., B-tree node split In-memory data structures, “latches”

Page 39: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

June 8, 2012 4

Two types of transactions User transactions System transactions

Invocation source User request System-internal

Database effects Logical database contents Physical database representation

Data location Database or buffer pool In-memory page images

Invocation overhead New thread Same thread

Locks Acquire & retain Test for conflicts

Commit overhead Force log to stable storage No forcing

Logging Full “redo” & “undo” “Redo” only usually

Failure recovery Rollback Completion

Hardware opportunity Non-volatile memory Transactional memory

Page 40: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

June 8, 2012 5

Two types of concurrency control Locks Latches

Separate… User transactions Threads Protect… Database contents In-memory data structures During… Entire transactions Critical sections Modes: Shared, exclusive, update,

intention, escrow, schema Shared, exclusive

Deadlock… Detection & resolution Avoidance … by… Waits-for graph analysis,

timeout, transaction abort, partial rollback, lock de-escalation

Coding discipline, lock leveling

Kept in… Lock manager’s hash table

Protected data structure

Page 41: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

June 8, 2012 6

Current trends and challenges •  Scalability

Query processing versus map-reduce (Hadoop etc.) Data mining, business intelligence, analytics Utilities (load, reorganization, …)

•  Implementation techniques Low-level synchronization Transactional memory Non-volatile memory Other novel hardware

Page 42: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

© 2011 Adobe Systems Incorporated. All Rights Reserved.

Me: Russell Williams. My product: Photoshop

1

•  Huge cross-platform code base on single threaded framework

•  Parallel computation since mid-90s using basic parallel_for

•  Scaling falls off beyond 4 cores for many operations.

•  Must trade off throughput for latency

•  Proliferation of thread pools

Page 43: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

© 2011 Adobe Systems Incorporated. All Rights Reserved.

Challenges — structure of the problem

2

•  Asynchrony vs. parallel compute

•  Available parallelism •  Amdahl’s law vs. events, views, PCI bus •  On server, parallelize per user. On desktop: one user •  Bandwidth limited — FLOPS / memory reference

•  80-core chips not coming; so"ware can’t use ‘em.

Page 44: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

© 2011 Adobe Systems Incorporated. All Rights Reserved.

Challenges — structure of the solutions

3

•  Heterogeneous environment •  C machine vs. data parallel, high latency, high throughput •  Different cache / memory hierarchies

•  Rapidly changing hardware landscape •  Discrete->Integrated GPU •  SSE -> AVX -> AVX2 -> AVX3

•  __m128i vDst = _mm_cv#ps_epi32(_mm_mul_ps(_mm_cvtepi32_ps(vSum0), vInvArea));

•  Variety, rapid evolution, and fragmentation of tools •  CUDA / DX / OpenCL / C++ AMP, vectorizing compilers

Page 45: Parallel&Programming&in&the&Real&World!& · Games – many systems ... Fortran, C/C++, MPI, OpenMP dominate

© 2011 Adobe Systems Incorporated. All Rights Reserved.

Intrinsics / Auto-

vec

Desktop GFlops (8-core 3.5GHz Sandy Bridge + AMD 6950)

4

Straight C++

OpenGL / OpenCL / C++ AMP

TBB GCD

0 500 1000 1500 2000 2500 3000

GPU CPU Vector Unit Multi-thread Scalar (GFlops)