Upload
kyrie
View
18
Download
0
Embed Size (px)
DESCRIPTION
ARMS Active Resource Management Services For Big Data Processing. Presentation Two. Agenda. 1: Title 2: Outline 3: Members 4: Mentor 5-6: Societal Issue 7: History 8-9: Dr. Li 10-11: Cluster Computing 12-14: Case Study 15: Accuracy 16: Current Major Functional Component Diagram - PowerPoint PPT Presentation
Citation preview
ARMS Active Resource
Management Services
For Big Data Processing
Presentation Two
4/09/2013 1
Agenda
• 1: Title• 2: Outline• 3: Members• 4: Mentor• 5-6: Societal Issue• 7: History• 8-9: Dr. Li• 10-11: Cluster Computing• 12-14: Case Study• 15: Accuracy• 16: Current Major Functional
Component Diagram• 17: Current Process Flow• 18: Problem Statement• 19: Proposed Major Functional
Component Diagram• 20: Proposed Process Flow
• 21-24: Dinosolve Walkthrough• 25: Dinosolve Issues• 26: Software• 27: Hardware• 28: Solution Statement• 29: Competition Identified• 30-32: 508 Compliance• 33: Objectives• 34: Benefits of Solution• 35-41: Milestones• 42: Sitemap• 43: Database Schema• 44: Entity Relationship Diagram• 45: Risks• 46: Conclusion• 47-50: References• 51-54: Appendix
4/09/2013 2
Group Members and Roles
• Scott Pardue (Team Leader)• Michael Rajs (Risk Manager)• Adam Willis (Algorithm Specialist)• Sybil Acotanza (Documentation
Specialist)• Jordan Heinrichs (Database Designer)• David Crook (User Interface
Designer)
4/09/2013 3
Dr. Yaohang Li
•Associate Professor in the Department of Computer Science at Old Dominion University.•Research interests include:
•Computational Biology: applies computational simulation techniques to solve biological problems
•Markov Chain Monte Carlo (MCMC) methods: statistical algorithm for sampling from probability distributions
•Parallel Distributed Grid Computing: uses multiple computers communicating via Internet to solve a problem
4/09/2013 4
How do researchers manage the massive amounts of data they are collecting in
order to benefit their research?
4/09/2013 5
“Every day, [mankind] creates 2.5 quintillion (2.5*10^18) bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” - IBM
http://www-01.ibm.com/software/data/bigdata/
4/09/2013 6
• Large Hadron Collider 2
• 150 million sensors report 40 million times per second• Watson on Jeopardy
• 200 million pages• Structured and Unstructured• 4 Terabytes of information
• DinoSolve Protein Prediction Server
• Proteins are made up of single or multiple amino acids• 20 different amino acids• If a protein is made up of 5 amino acids then the
number of possible proteins will be 20^5 or 3,200,000
Data Management Examples
4/09/2013 7
Big Data Analysis Hardware
4/09/2013 8
Physical Resource ManagementSubsystem
Scheduling andQueuing Subsystem
JobManagementSubsystem
Dr. Li’s Cluster Configuration
Server Cluster
Servers
Database Server Web ServerHardware
Dell PowerEdge R410 Server Head Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
4/09/2013 9
Dinosolve Issues
As it continues to grow in popularity, these are expected to occur:
• Limited hard resources for computation• CPU cycles• Memory• Disk space• Network bandwidth
• Server crashes
Goal is to prepare the system to be able to continue to support the research community in light of its expected growth in requests and to also enhance the design of the user interface
4/09/2013 10
Job Management Subsytem
4/09/2013 11
Server Cluster
Servers
Database Server Web ServerHardware
Dell PowerEdge R410 Server Head Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Physical resource management
4/09/2013 12
Server Cluster
Servers
Database Server Web ServerHardware
Dell PowerEdge R410 Server Head Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Scheduling and Queueing
4/09/2013 13
Server Cluster
Servers
Database Server Web ServerHardware
Dell PowerEdge R410 Server Head Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Dr. Li’s Grants
• DinoSolve• secured for a five year, $400,000
CAREER Award from the National Science Foundation
• Dr. Li• principal or co-principal investigator • research grants totaling more than $15.3
million
4/09/2013 14
Dr. Yaohang Li and Dinosolve
• Dinosolve examines a protein sequence of amino acids and determines if the protein can be manipulated by an addition of a disulfide bond
• Each computational result enhances the prediction accuracies for future results
• 40^20, larger than 10^32, different possible combinations for only the shortest sequence
4/09/2013 15
What is the problem?
300 simultaneous requests will cause the web server
to crash
4/09/2013 16
Syst
em th
roug
hput
(Mb/
sec)
Dinosolve Case Study
• Bioinformatics7
• Disulfide bond prediction program
• Disulfide bond creation is important to the research community
4/09/2013 17http://www.merriam-webster.com/
dictionary/bioinformatics
Dinosolve Users
• Drug design• Pharmaceutical companies
• Antibody design• To combat viruses
• Bio-energy development• Creation of new fuels to replace diminishing
fossil fuels
• Genetic mapping5
• Research to cure cancer, HIV, and other diseases
4/09/2013 18
Accuracy of Popular Tools
Dinosolve DiANNA Scratch Protein
Predictor
Accuracy 90.8% 81% 87%
More users use Dinosolve because of the enhanced accuracy
4/09/2013 19
4/09/2013 20
Web Server
MySQL Database ServerDinoSolve Algorithm
Internet
Researcher
Current Major Functional Component Diagram
4/09/2013 21
Current Process Flow
Web
ser
ver
Use
r (R
esea
rche
r)D
inoSo
lve
engin
eD
atab
ase
StartVisit
DinoSolveEnter sequence
and email address
Validity check
Input valid?Display error
Send sequence
Display N̈o
Reaction©
Execute Algorithm (Big Data
Calculation)
CYS?
No
CYS < 2
Bond formedStore results
(Big Data Storage)
Email link to results
View results End
CYS > 1
Yes
Current Process Flow
4/09/2013 22
RWP Major Functional Component Diagram
Web Server
MySQL Database Server
Internet
Researcher
SGE scheduler
Execution HostExecution Host
4/09/2013 23
Proposed Process FlowW
eb s
erve
rU
ser (R
esea
rche
r)SG
E ex
ecut
ion
host
Dat
abas
eSG
E sc
hedu
ler
StartVisit
DinoSolveEnter sequence
and email address
Validity check
Input valid?Display error
Send sequence
Display N̈o
Reaction©
Execute Algorithm (Big Data
calculation)
CYS?
No
CYS < 2
Bond formedStore results
(Big Data Storage)
Email link to results
View results End
CYS > 1
Yes
Accept and schedule job
RWP Process Flow
Objectives
• Configure, utilize, and optimize the SGE
• Aesthetically pleasing and professional user interface
• 508 Compliance• Improve the existing database
schema and adding user accounts
4/09/2013 24
Benefits from Goals
• Efficient utilization of available resources and increased throughput of the cluster
• Professional user interface leading to a rise in popularity
• Accessibility• Security and efficient access of
previous submissions
4/09/2013 25
User interface will be improved to be more aesthetically pleasing
4/09/2013 26
Working with DinosolveInput titleInput protein sequenceInput e-mail addressSubmit, then wait for confirmation...
Protein Sequence: string of alphabetic characters, each of which represent a particular amino acid in the protein
4/09/2013 27
Working with DinosolveConfirmation of requestNow wait for results
4/09/2013 28
Working with DinosolveCheck your e-mail,Click the link providedThe results are displayed
4/09/2013 29
Why is it important to be compliant?
If an entity wishes to receive government funding then any electronic form the entity uses
must be 508 compliant
4/09/2013 30
508 Compliance
• Amended Rehabilitation Act of 1998• require Federal agencies to make their
electronic and information technology accessible to people with disabilities [32]
• enacted to eliminate barriers in information technology, to make available new opportunities for people with disabilities, and to encourage development of technologies that will help achieve these goals [32]
4/09/2013 31http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973
Compliance of Popular Tools
Dinosolve DiANNA Scratch Protein
Predictor
508.22 compliance percentage
67% 85% 67%
4/09/2013 32
Milestones
4/09/2013 33
ARMS
Software TestingHardware
UnaffectedWithinScope
Three Computational Nodes
Each processor has four execution slots
Dell PowerEdge R410 Server
Computational Node
Intel E5506processor
Intel E5506processor
4/09/2013 34
*Each computational node has two processors
6 processors yield 24 execution slots
Processors
4/09/2013 35
Server Cluster
Servers
Database Server Web ServerHardware
Dell PowerEdge R410 Server Head Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
4/09/2013 36
Software Milestones
Software
DatabaseSun Grid Engine
UserInterface
Algorithm
Scheduler
UnaffectedWithinScope
DisulfideBond
Predictor
• Cluster Performance• Stress testing• Prevention of denial of
service attacks
• Database Performance• Stress testing• Prevention of MySQL
injection attacks
4/09/2013 37
Testing
ServerCluster
Performance
DatabasePerformance
Testing Milestones
4/09/2013 38
Complete Milestone Tree
Software Database AlgorithmSun Grid Engine
SchedulerDisulfide Bond
Predictor
Server Cluster
Servers
Database Server Web ServerHardware
Dell PowerEdge R410 Server Head Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Dell PowerEdge R410 Server
Computational
Node
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
Intel E5506processor
Testing
Server ClusterPerformance
DatabasePerformance
User InterfaceARMS
UnaffectedWithinScope
4/09/2013 39
DinoSolve Homepage
ReferencesInformation Statistics HelpUser ContactAdmin
Sitemap
4/09/2013 40
Database Schema
4/09/2013 41
Entity Relationship
RisksRisks
• T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength
• T2: Improper synchronization of cluster resources could lead to a deadlock
• T3: Race conditions between the HPCR cluster and the MySQL database
• T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system
• C1: Users may not like new design• C2: SGE does not enforce exclusive
access to the reserved processors
I
m
p
a
c
t
Probability
T1
T2 C2
T3C1
T4
Technical Risks and Mitigations
I
m
p
a
c
t
4/09/2013 43
T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength
Probability: 1 Impact: 5Mitigation: Creating indexes, use specialized data structures and aggregate tables.
T2: Improper synchronization of cluster resources can lead to a deadlock
Probability: 2 Impact: 4Mitigation: Modify and read application data. Alter execution logic and basic software configuration of SGE.
T1
T2
Probability
Technical Risks and Mitigations
T3: Race conditions between the HPCR cluster and the MySQL database.
Probability: 3 Impact: 3Mitigation: Using software control on the SGE.
T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system
Probability: 2 Impact: 2Mitigation: Keep virus protection up to date. Use very specific types of passwords. Run current scripts because hackers look for dated scripts because they most likely have a hole in them. Limit access to certain files.
I
m
p
a
c
t
4/09/2013 44
T3
T4
Probability
Risks
C1: Users may not like new design. Probability: 3 Impact: 3Mitigation: Create a new more aesthetically pleasing design
C2: SGE does not enforce exclusive access to the reserved processors. Probability: 4 Impact: 4Mitigation: Qsub and knowledge of node memory capacity
I
m
p
a
c
t
4/09/2013 45
C2
C1
Probability
With the updated user interface and correctly configured Sun Grid Engine, Dr. Li hopes to establish a
reputable, reliable, and aesthetically pleasing Disulfide
Bonding Prediction Server
4/09/2013 46
References for history
1. http://www-01.ibm.com/software/data/bigdata/2. http://en.wikipedia.org/wiki/Big_data3. http
://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingested-every-day/
4. http://en.wikipedia.org/wiki/Computer_cluster
4/09/2013 47
References for case study
5. Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling Applications [Abstract]. National Science Foundation Award Abstract #1066471.
6. Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features. Biotechnology and Bioinformatics Symposium
7. bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013, from http://www.merriam-webster.com/dictionary/bioinformatics
8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry Dictionary: http://guweb2.gonzaga.edu/faculty/cronk/biochem/D-index.cfm?definition=disulfide_bond
9. Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx.
10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/
4/09/2013 48
References for competition
11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011 URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF, PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf)13. Dr. Li’s site http://hpcr.cs.odu.edu/dinosolve/14. Scratch Predictor http://scratch.proteomics.ics.uci.edu/15. DiANNA server http://clavius.bc.edu/~clotelab/DiANNA/Portable Batch System (PBS)16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=118. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdfMoab HPC Suite22.http://www.adaptivecomputing.com/publication/420/wppa_open/IBM Platform LSF23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDFApache Hadoop with Zookeeper24. http://zookeeper.apache.org/doc/current/zookeeperOver.html25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf
4/09/2013 49
Reference for 508 Compliance
26. http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973
4/09/2013 50
Appendix
• 52: Competition Matrix for Resource Management Systems
• 53-55: 508.22 Compliance Statistics for Dinosolve
4/09/2013 51
Competing Resource Management Systems
Features of systems
PBS LSF SGE
Supported platforms
Unix Unix & NT Unix
Multi-clustersupport
Yes Yes No
System level checkpoint
restart
No Yes Yes
User level checkpoint
restart
No No Yes
Large computational grid support
No No No
Massive Scalability
Yes Yes Yes
Parallel job support with Sun HPC ClusterTools
Loose Integration
Tight Integration Loose Integration
Distribution format of end
product
Source Binary only Binary and Source
Free? Yes No Yes
Posix 1002.2d compliance
Yes No Yes4/09/2013 52
4/09/2013 53
4/09/2013 54
4/09/2013 55