Finding Discriminating DNA Probe Sequences by Implementing a Parallelized Solution in a Cluster REU Camilo A. Silva Professor and Advisor: Dr. S. Masoud

Finding Discriminating DNA Probe Sequences by Implementing a Parallelized Solution in a Cluster REU Camilo A. Silva Professor and Advisor: Dr. S. Masoud Sadjadi Summer 2008 BIOinformatics Group Members: Liu, Guangyuan Robinson, Michael Silva, Camilo A. objectives Problem Problem Motivation Motivation Initial goals Initial goals Project Schedule Project Schedule Challenges Challenges Lessons learned Lessons learned Accomplishments Accomplishments Project Status Project Status Wrapping up Wrapping up Continuation of project Continuation of project Future work Future work Conclusion Conclusion Acknowledgements Acknowledgements References References problem What is the most efficient parallel program structure algorithm to be used in a cluster by implementing MPI and what type of algorithm is required for the program to be both self-healing and self optimized by maintaining an optimal performance? How can this program be implemented in a web application by using tools that enhance the friendly user interface? motivation Our project = discriminating probe finderOur project = discriminating probe finder Characteristics:Characteristics: Capable of finding all possible probes combinations of a certain genome and compare them against another genome based on a probe length parameter Find other variations such as reverse, inverse, and compliment whenever specified in a parameter The output result are the probes that are present in one genome but not present in anotherthese are known as discriminating probes. ABLE TO BE RUN ON A CLUSTERABLE TO BE RUN ON A CLUSTER IMPLEMENT SELF-MANAGING FUNCTIONSIMPLEMENT SELF-MANAGING FUNCTIONS initial goals Implement the parallelization of the finding discriminating probes application Create a self-managing system for the application Implement a web application for the project project schedule 6/12-6/23: MPI theory preparation + Autonomic computing 6/12-6/23: MPI theory preparation + Autonomic computing 6/25 WED: parallelization programming starts 6/25 WED: parallelization programming starts 7/7-7/13: Test simulated MPI programs; learn about MPI-IO and explain to my group members how to use MPI 7/7-7/13: Test simulated MPI programs; learn about MPI-IO and explain to my group members how to use MPI 7/15/2008: Deadline to have MPI implementation ready for the project 7/15/2008: Deadline to have MPI implementation ready for the project 7/16-7/23: Learn about MPI error handling and MPI debugging 7/16-7/23: Learn about MPI error handling and MPI debugging 7/23-7/27: First parallelized jobs were assigned to GCB during this past weekend 7/23-7/27: First parallelized jobs were assigned to GCB during this past weekend 7/27: MPI parallelized program was completed with an implementation of a self-healing attribute 7/27: MPI parallelized program was completed with an implementation of a self-healing attribute 7/27-8/14: Make modifications to the parallelized program if necessary 7/27-8/14: Make modifications to the parallelized program if necessary 7/31-8/14: Write and complete paper to be submitted on August 15, /31-8/14: Write and complete paper to be submitted on August 15, 2008 challenges Having to learn to be an independent researcher Communicating my ideas to my team members in an efficient fashion Being able to complete all my time projections on time Studying massive amounts of detailed material in short periods of time Debugging, debugging, debugging lessons learned The ability to work virtually in a global team environment is an opportunity to take advantage of If someone has the willingness to explore new lands and learn new magic just do it, read it, and practice it Problems can be solved by communicating with others ENJOY and LOVE what YOU DO! accomplishments Three topics will be discussed: Parallelization Self-management Results parallelization The master node will acquire info from the user in regards to the different genomes to be compared for project18 The master node will acquire info from the user in regards to the different genomes to be compared for project18 The master node will administer the data and create jobs to each slave node. The master node will administer the data and create jobs to each slave node. Each slave node will receive the data from the master node and start execution of project18 Each slave node will receive the data from the master node and start execution of project18 After a node has completed its task, it will report its completion to the master node which will determine if there are more tasks to be completed. If there are, the proceeding task will be given to such node. After a node has completed its task, it will report its completion to the master node which will determine if there are more tasks to be completed. If there are, the proceeding task will be given to such node. When the program has finished, all results shall be stored in a predefined directory where such would be available for review. When the program has finished, all results shall be stored in a predefined directory where such would be available for review. parallel program design ooooooo startend FM.N. C M.N Master Node 1-7 Slave Nodes F Finish C Completion a brief pseudo-code of the parallelization //libraries + definitions #include //main Int main () { //variable definitions If (rank == master node) { //ask user for input, create the queue and initialize all tasks While (//there are more items left in the queue) { //receive completion signals, keep fault control and task control active, and assign new available tasks to available nodes }//end while }end if //continue on the right Else { //receive the number of items left of the queue While (//there are more items left) { //receive from master node the genome parameter EXECUTE PROJECT18 Create output files Submit completion code to node0 }//end while } end of else MPI_Finalize () ; }// end of main Void taskControl () { Makes sure that each task is completed accordingly and is succesful } self-management The self management implementation that was added in my program is a self-healing property. It is a trivial application that functions as a checker of errors whenever messages are sent to the slave nodes from the master node If an error is found, it will be detected and such task will be stored in an array that carries the error status of each task assigned Each task that has an error during the sending of a message will be reassigned accordingly self-healing -7 TASK_NUMS Each time a message is sent from the master node to the slave nodes an MPI error handler is active checking for errors. If there is an error in the message being sent, such will be reported in an array. Afterwards, the master node will resend the message to the assigned slave node. results As far as the parallelization of the program it works because our group leader Michael Robinson did a test with some of the small files two weeks ago. As far as the parallelization of the program it works because our group leader Michael Robinson did a test with some of the small files two weeks ago. As far as the self-healing property it has not being tested yet when errors are found due to the fact that as of know there have not been any errors. As far as the self-healing property it has not being tested yet when errors are found due to the fact that as of know there have not been any errors. The results that will be shown are from the parallelized program that Gary completed. The results that will be shown are from the parallelized program that Gary completed. results table ENDJULYSTARTNODE TIME USED !FOUNDGENOMES 19:382517:00126:38:00109, :072517:00227:07:00137,25301cs2 21:432517:00328:43:0091, :522517:00429:52:0018, :182517:00530:18:0038, results statistics project status As far as the parallelization of the project, that part is complete. The self-healing part of the program could be enhanced in a better way by having two autonomic agents: one that checks for connectivity of nodes and another that checks on the functionality status of each slave node There is another thing to fix which is an error that seems to be linked with memory leakage. Such error is present whenever there are more tasks assigned than nodes One of the most important parts of what is left to do is data validation Finally, performance tests will be completed in the following couple of days for the data analysis wrapping up My goal is to help my group in writing the finalized draft of the paper My goal is to help my group in writing the finalized draft of the paper If necessary, I would be modifying my parallel program to fit the testing needs. For example, instead of asking the user for inputall the input should be read from a file If necessary, I would be modifying my parallel program to fit the testing needs. For example, instead of asking the user for inputall the input should be read from a file continuation of project I would like to have the opportunity to enhance my program to have two self-healing autonomic components that would help in finding faults in both connectivity and task functionality of the slave nodes Find a way to self-optimize my program future work One of my initials goals was to create a web interface that could initialize the tasks in the cluster. This would be a fun and interesting work to perform in the future One of my initials goals was to create a web interface that could initialize the tasks in the cluster. This would be a fun and interesting work to perform in the future conclusion Through out my first research experience I had the opportunity to learn about what it takes to be an independent researcher, as well as working with a team in a specific task From my initial four goals, I was able to successfully accomplish two of those. Although I thought that I would be able to do everything I projectedI did not have into account the amount of reading and learning that I had to do prior programming in parallel. I did not have into account debugging and testing as well. Still, I am glad to know that I am not the same person that I was two months before. Now, I am more knowledgeable in a specific topic. And, I feel a desire to continue doing research and be able to contribute in science! Acknowledgements Special thanks to: David Villegas Javier Delgado Javier Figueroa Juan C. Martinez Dr. S. Masoud Sadjadi Dr. Hector Duran Dr. Scott Graham Dr. Masoud Milani REU + PIRE Staff My Group Members: Guangyuan Gary Liu Michael Robinson And God for giving the strength to study hard And to all of YOU for being here listening to me! references -- MPI general: [1][2][3][4][5]-- MPI Error handling: [1][2][3][4][5]-- MPI Debugging: [1][2][3][4][5]-- MPI IO: [1][2][3][4] Great Site containing all the information needed on MPI --ADvanced IBM's[5]by Rajeev Thakur -- Grid + Cluster info: [1]

Documents

Finding Discriminating DNA Probe Sequences by Implementing a Parallelized Solution in a Cluster REU Camilo A. Silva Professor and Advisor: Dr. S. Masoud