Group May 09-06 Bryan McCoy Kinit Patel Tyson Williams Advisor/Client: Zhao Zhang

HIGH PERFORMANCE

BIOINFORMATICS

Group May 09-06

Bryan McCoy

Kinit Patel

Tyson Williams

Advisor/Client: Zhao Zhang

What is Bioinformatics? Genetic sequencing Massive amounts of data Many simple operations Perfect for distributed

computing

Problem Current solutions are not realistically feasible

Too expensive○ Super computers○ High powered servers

Too slow○ Some inputs can takes several days

Need for high speed, low cost solutions

Our Solution Cell Processor

Based on Phase 1 Cluster of PlayStation 3s MPI

Message Passing Interface

IBM Cell Broadband Engine 1 Power Processing Element (PPE) 8 Synergistic Processing Elements (SPEs)

Only 6 SPEs are accessible on a PlayStation 3 4 high speed rings for processor communication

DNAPenny Compares DNA strands

from different species Score indicates

evolution similarities between two species

Branch and bound search algorithm

Functional requirements

FR1. Ported applications shall run on the Cell B.E.

FR2. The results returned shall be the same as the original program.

FR3. The applications shall return their runtime.

FR4. The applications shall execute in parallel on multiple Cell B.E.s.

Non-Functional Requirements NF1. The Cells shall all run on the Linux

OS. NF2. The resulting runtimes of the

ported applications shall be faster than on the original applications.

NF3. The ported application shall be coded in the C language.

Market Survey Results of the survey point to a huge speed

up of computationally intensive programs. Dr. Gaurav Khanna at the University of

Massachusetts Dartmouth used cluster of 8 PS3s to replace a supercomputer.

Universitat Pompeu Fabra, in Barcelona, deployed in 2007 a BOINC system called PS3GRID for collaborative biological computing.

Risk Assessment

Slow network speed Software support Limited RAM Hardware Failure

Lower quality entertainment hardware Limited prior experience Software development schedule

Resource Requirements

3 PlayStation 3s High performance network switch Cell programming books Front node (desktop computer) Time

Software Environment Use Fedora 9 OS as it

is currently supported by the Cell SDK 3.1.

Uses the command line for user interface.

Use the IBM XLC compiler and/or the current GCC compiler.

Hardware Environment 3 PlayStation 3s High speed Crossbar switch Private network Front Node (desktop computer)

Proxy serverNetwork File Store (NFS)

I/O Input

Inputs are DNA sequences stored in a text file.

Text is a CustalW alignment organized in Phylip format, a standard format for biological applications.

OutputOutputs are

○ The parsimony score○ The best trees○ The execution time

The score and best trees are output to the screen and to text files.

The execution time is output to a CSV (Comma Separated Value) file.

Work Breakdown Structure

Port Apps to Cluster PS3s

Problem Definition

Research Cell/B.E

Research Bioperf Suite

Research Distributed Parallel

Algorithms

Research Previously Done

Work

End Product Design

Design Requirements

Design Process

Design Documents

Considerations and Selections

Decide Which Linux to Install

Decide which applications to port

End Product Implementation

Hardware Implementation

Prototyping Implementation

Software Implementation

End Product Testing

Ensure Correctness of Output Results

Benchmarking

Final Documentation and

Demonstration

Create Final Report

Create Project Poster

Prepare for Presentation

Work Schedule Gant chart

Deliverables

Source Code Compiled Executable Runtime Comparisons Final Report Poster Final Presentation

Costs Time

Approximately 555 man hours total.

Freely donated.

Total cost $0.

Equipment3 PlayStation 3s

○ Provided by clientCrossbar router

○ Provided by clientStandard desktop computer

○ Provided by department

Total cost $0.

Development: Initial Overview Use MPI to distribute the program to the

multiple PlayStations. Each PlayStation would search one

branch of the tree. 1 function (supplement) took 90% of the

runtimePhase 1 ported this function to the SPEs

Development: Difficulties Found a bug in supplement. The bug did not affect results but did affect

runtime. We contacted the original developer, Dr.

Felsenstein at the University of Washington, who fixed the bug.

The fix significantly improved runtime. However, the fix negated all work done by

Phase 1 as supplement no longer took a significant amount of runtime.

Development: Reworking After the bug fix, no

single function took a significant amount of runtime.

We decided to distribute branches of the tree search to different processors.

Development: Results

Completed our goals Divided work among 3 PlayStation 3s.Produced faster code that comparable

sequential environment. Due to time constraints, we were not

able to port the code to the SPEs.

Testing

Used script to test multiple inputs. Averaged the runtimes. Used several different code revisions

and machines to provide comparisons. Projected the speedup that could be

attained if code was ported to SPEs.

Results: Actual Our current code is

20.76 times faster than the it was at the beginning of the semester.

Surpassed our original projections, which assumed the use of the SPEs.

Code revision Runtime (sec)

X Speedup X Speedup

(compared to desktop)

# of available cores

Original (Core2)

1861.66 0.116 1

With Bug Fixes (Core 2)

218.8 1 1

Original (1 PPE)

4953.51 1 0.044 1

With Bug Fixes (1 PPE)

662.70 7.47 0.330 1

MPI with Bug Fixes (3 PPEs)

238.57 20.76 0.917 3

MPI with Bug Fixes (3 PPEs, 18 SPEs)

(Projected)

34.08 145.35 6.420 21

Original Projections

334.82 14.79 0.653 21

Results: MPI The speedup for

MPI was 2.78. Excellent speedup

for 3 nodes.


X Speedup X Speedup



Original (Core2)

1861.66 0.116 1


218.8 1 1

Original (1 PPE)

4953.51 1 0.044 1


662.70 7.47 0.330 1


238.57 20.76 0.917 3


(Projected)

34.08 145.35 6.420 21


334.82 14.79 0.653 21

Results: Comparison Our final code came

close to a high powered desktop.Core 2 Quad at

2.66 GHz Our projected

results indicate a speedup of 6.4.


X Speedup X Speedup



Original (Core2)

1861.66 0.116 1


218.8 1 1

Original (1 PPE)

4953.51 1 0.044 1


662.70 7.47 0.330 1


238.57 20.76 0.917 3


(Projected)

34.08 145.35 6.420 21


334.82 14.79 0.653 21

Results: Projected

Using all SPEs, the speedup should be 145.35 Assuming SPEs run as fast

as the PPEs○ Before SPE vectorization


X Speedup X Speedup



Original (Core2)

1861.66 0.116 1


218.8 1 1

Original (1 PPE)

4953.51 1 0.044 1


662.70 7.47 0.330 1


238.57 20.76 0.917 3


(Projected)

34.08 145.35 6.420 21


334.82 14.79 0.653 21

Conclusions

Achieved our goal of using MPI to get runtime improvement.

Contributed a major fix to a widely used application.

Surpassed our initial runtime goal. Projected results show an even larger

runtime improvement still possible.

Acknowledgements May08-24 group (phase I)

Kyle Byerly Shannon McCormick Matt Rohlf Bryan Venteicher

DNAPenny Author Dr. Felsenstein

Advisor Zhao Zhang

Environment Help Steve Nystrom

Questions?