55
ARMS Active Resource Management Services For Big Data Processing Presentation Two 4/09/201 3 1

ARMS Active Resource Management Services For Big Data Processing

  • Upload
    kyrie

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

ARMS Active Resource Management Services For Big Data Processing. Presentation Two. Agenda. 1: Title 2: Outline 3: Members 4: Mentor 5-6: Societal Issue 7: History 8-9: Dr. Li 10-11: Cluster Computing 12-14: Case Study 15: Accuracy 16: Current Major Functional Component Diagram - PowerPoint PPT Presentation

Citation preview

Page 1: ARMS  Active Resource Management Services For Big Data Processing

ARMS Active Resource

Management Services

For Big Data Processing

Presentation Two

4/09/2013 1

Page 2: ARMS  Active Resource Management Services For Big Data Processing

Agenda

• 1: Title• 2: Outline• 3: Members• 4: Mentor• 5-6: Societal Issue• 7: History• 8-9: Dr. Li• 10-11: Cluster Computing• 12-14: Case Study• 15: Accuracy• 16: Current Major Functional

Component Diagram• 17: Current Process Flow• 18: Problem Statement• 19: Proposed Major Functional

Component Diagram• 20: Proposed Process Flow

• 21-24: Dinosolve Walkthrough• 25: Dinosolve Issues• 26: Software• 27: Hardware• 28: Solution Statement• 29: Competition Identified• 30-32: 508 Compliance• 33: Objectives• 34: Benefits of Solution• 35-41: Milestones• 42: Sitemap• 43: Database Schema• 44: Entity Relationship Diagram• 45: Risks• 46: Conclusion• 47-50: References• 51-54: Appendix

4/09/2013 2

Page 3: ARMS  Active Resource Management Services For Big Data Processing

Group Members and Roles

• Scott Pardue (Team Leader)• Michael Rajs (Risk Manager)• Adam Willis (Algorithm Specialist)• Sybil Acotanza (Documentation

Specialist)• Jordan Heinrichs (Database Designer)• David Crook (User Interface

Designer)

4/09/2013 3

Page 4: ARMS  Active Resource Management Services For Big Data Processing

Dr. Yaohang Li

•Associate Professor in the Department of Computer Science at Old Dominion University.•Research interests include:

•Computational Biology: applies computational simulation techniques to solve biological problems

•Markov Chain Monte Carlo (MCMC) methods: statistical algorithm for sampling from probability distributions

•Parallel Distributed Grid Computing: uses multiple computers communicating via Internet to solve a problem

4/09/2013 4

Page 5: ARMS  Active Resource Management Services For Big Data Processing

How do researchers manage the massive amounts of data they are collecting in

order to benefit their research?

4/09/2013 5

Page 6: ARMS  Active Resource Management Services For Big Data Processing

“Every day, [mankind] creates 2.5 quintillion (2.5*10^18) bytes of data — so much that 90% of the data in the world today has been created in the last two years alone.” - IBM

http://www-01.ibm.com/software/data/bigdata/

4/09/2013 6

Page 7: ARMS  Active Resource Management Services For Big Data Processing

• Large Hadron Collider 2

• 150 million sensors report 40 million times per second• Watson on Jeopardy

• 200 million pages• Structured and Unstructured• 4 Terabytes of information

• DinoSolve Protein Prediction Server

• Proteins are made up of single or multiple amino acids• 20 different amino acids• If a protein is made up of 5 amino acids then the

number of possible proteins will be 20^5 or 3,200,000

Data Management Examples

4/09/2013 7

Page 8: ARMS  Active Resource Management Services For Big Data Processing

Big Data Analysis Hardware

4/09/2013 8

Physical Resource ManagementSubsystem

Scheduling andQueuing Subsystem

JobManagementSubsystem

Page 9: ARMS  Active Resource Management Services For Big Data Processing

Dr. Li’s Cluster Configuration

Server Cluster

Servers

Database Server Web ServerHardware

Dell PowerEdge R410 Server Head Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

4/09/2013 9

Page 10: ARMS  Active Resource Management Services For Big Data Processing

Dinosolve Issues

As it continues to grow in popularity, these are expected to occur:

• Limited hard resources for computation• CPU cycles• Memory• Disk space• Network bandwidth

• Server crashes

Goal is to prepare the system to be able to continue to support the research community in light of its expected growth in requests and to also enhance the design of the user interface

4/09/2013 10

Page 11: ARMS  Active Resource Management Services For Big Data Processing

Job Management Subsytem

4/09/2013 11

Server Cluster

Servers

Database Server Web ServerHardware

Dell PowerEdge R410 Server Head Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Page 12: ARMS  Active Resource Management Services For Big Data Processing

Physical resource management

4/09/2013 12

Server Cluster

Servers

Database Server Web ServerHardware

Dell PowerEdge R410 Server Head Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Page 13: ARMS  Active Resource Management Services For Big Data Processing

Scheduling and Queueing

4/09/2013 13

Server Cluster

Servers

Database Server Web ServerHardware

Dell PowerEdge R410 Server Head Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Page 14: ARMS  Active Resource Management Services For Big Data Processing

Dr. Li’s Grants

• DinoSolve• secured for a five year, $400,000

CAREER Award from the National Science Foundation

• Dr. Li• principal or co-principal investigator • research grants totaling more than $15.3

million

4/09/2013 14

Page 15: ARMS  Active Resource Management Services For Big Data Processing

Dr. Yaohang Li and Dinosolve

• Dinosolve examines a protein sequence of amino acids and determines if the protein can be manipulated by an addition of a disulfide bond

• Each computational result enhances the prediction accuracies for future results

• 40^20, larger than 10^32, different possible combinations for only the shortest sequence

4/09/2013 15

Page 16: ARMS  Active Resource Management Services For Big Data Processing

What is the problem?

300 simultaneous requests will cause the web server

to crash

4/09/2013 16

Syst

em th

roug

hput

(Mb/

sec)

Page 17: ARMS  Active Resource Management Services For Big Data Processing

Dinosolve Case Study

• Bioinformatics7

• Disulfide bond prediction program

• Disulfide bond creation is important to the research community

4/09/2013 17http://www.merriam-webster.com/

dictionary/bioinformatics

Page 18: ARMS  Active Resource Management Services For Big Data Processing

Dinosolve Users

• Drug design• Pharmaceutical companies

• Antibody design• To combat viruses

• Bio-energy development• Creation of new fuels to replace diminishing

fossil fuels

• Genetic mapping5

• Research to cure cancer, HIV, and other diseases

4/09/2013 18

Page 19: ARMS  Active Resource Management Services For Big Data Processing

Accuracy of Popular Tools

Dinosolve DiANNA Scratch Protein

Predictor

Accuracy 90.8% 81% 87%

More users use Dinosolve because of the enhanced accuracy

4/09/2013 19

Page 20: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 20

Web Server

MySQL Database ServerDinoSolve Algorithm

Email

Internet

Researcher

Current Major Functional Component Diagram

Page 21: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 21

Current Process Flow

Web

ser

ver

Use

r (R

esea

rche

r)D

inoSo

lve

engin

eD

atab

ase

StartVisit

DinoSolveEnter sequence

and email address

Validity check

Input valid?Display error

Send sequence

Display N̈o

Reaction©

Execute Algorithm (Big Data

Calculation)

CYS?

No

CYS < 2

Bond formedStore results

(Big Data Storage)

Email link to results

View results End

CYS > 1

Yes

Current Process Flow

Page 22: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 22

RWP Major Functional Component Diagram

Web Server

MySQL Database Server

Email

Internet

Researcher

SGE scheduler

Execution HostExecution Host

Page 23: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 23

Proposed Process FlowW

eb s

erve

rU

ser (R

esea

rche

r)SG

E ex

ecut

ion

host

Dat

abas

eSG

E sc

hedu

ler

StartVisit

DinoSolveEnter sequence

and email address

Validity check

Input valid?Display error

Send sequence

Display N̈o

Reaction©

Execute Algorithm (Big Data

calculation)

CYS?

No

CYS < 2

Bond formedStore results

(Big Data Storage)

Email link to results

View results End

CYS > 1

Yes

Accept and schedule job

RWP Process Flow

Page 24: ARMS  Active Resource Management Services For Big Data Processing

Objectives

• Configure, utilize, and optimize the SGE

• Aesthetically pleasing and professional user interface

• 508 Compliance• Improve the existing database

schema and adding user accounts

4/09/2013 24

Page 25: ARMS  Active Resource Management Services For Big Data Processing

Benefits from Goals

• Efficient utilization of available resources and increased throughput of the cluster

• Professional user interface leading to a rise in popularity

• Accessibility• Security and efficient access of

previous submissions

4/09/2013 25

Page 26: ARMS  Active Resource Management Services For Big Data Processing

User interface will be improved to be more aesthetically pleasing

4/09/2013 26

Page 27: ARMS  Active Resource Management Services For Big Data Processing

Working with DinosolveInput titleInput protein sequenceInput e-mail addressSubmit, then wait for confirmation...

Protein Sequence: string of alphabetic characters, each of which represent a particular amino acid in the protein

4/09/2013 27

Page 28: ARMS  Active Resource Management Services For Big Data Processing

Working with DinosolveConfirmation of requestNow wait for results

4/09/2013 28

Page 29: ARMS  Active Resource Management Services For Big Data Processing

Working with DinosolveCheck your e-mail,Click the link providedThe results are displayed

4/09/2013 29

Page 30: ARMS  Active Resource Management Services For Big Data Processing

Why is it important to be compliant?

If an entity wishes to receive government funding then any electronic form the entity uses

must be 508 compliant

4/09/2013 30

Page 31: ARMS  Active Resource Management Services For Big Data Processing

508 Compliance

• Amended Rehabilitation Act of 1998•  require Federal agencies to make their

electronic and information technology accessible to people with disabilities [32]

•  enacted to eliminate barriers in information technology, to make available new opportunities for people with disabilities, and to encourage development of technologies that will help achieve these goals [32]

4/09/2013 31http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973

Page 32: ARMS  Active Resource Management Services For Big Data Processing

Compliance of Popular Tools

Dinosolve DiANNA Scratch Protein

Predictor

508.22 compliance percentage

67% 85% 67%

4/09/2013 32

Page 33: ARMS  Active Resource Management Services For Big Data Processing

Milestones

4/09/2013 33

ARMS

Software TestingHardware

UnaffectedWithinScope

Page 34: ARMS  Active Resource Management Services For Big Data Processing

Three Computational Nodes

Each processor has four execution slots

Dell PowerEdge R410 Server

Computational Node

Intel E5506processor

Intel E5506processor

4/09/2013 34

Page 35: ARMS  Active Resource Management Services For Big Data Processing

*Each computational node has two processors

6 processors yield 24 execution slots

Processors

4/09/2013 35

Server Cluster

Servers

Database Server Web ServerHardware

Dell PowerEdge R410 Server Head Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Page 36: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 36

Software Milestones

Software

DatabaseSun Grid Engine

UserInterface

Algorithm

Scheduler

UnaffectedWithinScope

DisulfideBond

Predictor

Page 37: ARMS  Active Resource Management Services For Big Data Processing

• Cluster Performance• Stress testing• Prevention of denial of

service attacks

• Database Performance• Stress testing• Prevention of MySQL

injection attacks

4/09/2013 37

Testing

ServerCluster

Performance

DatabasePerformance

Testing Milestones

Page 38: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 38

Complete Milestone Tree

Software Database AlgorithmSun Grid Engine

SchedulerDisulfide Bond

Predictor

Server Cluster

Servers

Database Server Web ServerHardware

Dell PowerEdge R410 Server Head Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Dell PowerEdge R410 Server

Computational

Node

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Intel E5506processor

Testing

Server ClusterPerformance

DatabasePerformance

User InterfaceARMS

UnaffectedWithinScope

Page 39: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 39

DinoSolve Homepage

ReferencesInformation Statistics HelpUser ContactAdmin

Sitemap

Page 40: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 40

Database Schema

Page 41: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 41

Entity Relationship

Page 42: ARMS  Active Resource Management Services For Big Data Processing

RisksRisks

• T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength

• T2: Improper synchronization of cluster resources could lead to a deadlock

• T3: Race conditions between the HPCR cluster and the MySQL database

• T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system

• C1: Users may not like new design• C2: SGE does not enforce exclusive

access to the reserved processors

I

m

p

a

c

t

Probability

T1

T2 C2

T3C1

T4

Page 43: ARMS  Active Resource Management Services For Big Data Processing

Technical Risks and Mitigations

I

m

p

a

c

t

4/09/2013 43

T1: Larger volumes of queries could cause slower processing speeds and may be the result of hardware strength

Probability: 1 Impact: 5Mitigation: Creating indexes, use specialized data structures and aggregate tables.

T2: Improper synchronization of cluster resources can lead to a deadlock

Probability: 2 Impact: 4Mitigation: Modify and read application data. Alter execution logic and basic software configuration of SGE.

T1

T2

Probability

Page 44: ARMS  Active Resource Management Services For Big Data Processing

Technical Risks and Mitigations

T3: Race conditions between the HPCR cluster and the MySQL database.

Probability: 3 Impact: 3Mitigation: Using software control on the SGE.

T4: A local attacker could exploit these vulnerabilities and cause a crash or execute arbitrary code on the system

Probability: 2 Impact: 2Mitigation: Keep virus protection up to date. Use very specific types of passwords. Run current scripts because hackers look for dated scripts because they most likely have a hole in them. Limit access to certain files.

I

m

p

a

c

t

4/09/2013 44

T3

T4

Probability

Page 45: ARMS  Active Resource Management Services For Big Data Processing

Risks

C1: Users may not like new design. Probability: 3 Impact: 3Mitigation: Create a new more aesthetically pleasing design

C2: SGE does not enforce exclusive access to the reserved processors. Probability: 4 Impact: 4Mitigation: Qsub and knowledge of node memory capacity

I

m

p

a

c

t

4/09/2013 45

C2

C1

Probability

Page 46: ARMS  Active Resource Management Services For Big Data Processing

With the updated user interface and correctly configured Sun Grid Engine, Dr. Li hopes to establish a

reputable, reliable, and aesthetically pleasing Disulfide

Bonding Prediction Server

4/09/2013 46

Page 48: ARMS  Active Resource Management Services For Big Data Processing

References for case study

5.  Li, Y. (2010, September 1). CAREER: Novel Sampling Approaches for Protein Modeling Applications [Abstract]. National Science Foundation Award Abstract #1066471.

6.  Li, Y., & Yaseen, A. (2012). Enhancing Protein Disulfide Bonding Prediction Accuracy with Context-based Features. Biotechnology and Bioinformatics Symposium

7.  bioinformatics. 2011. In Merriam-Webster.com. Retrieved February 15, 2013, from http://www.merriam-webster.com/dictionary/bioinformatics

8. Cronk, J. D. (2012). Disulfide Bond. Retrieved February 15, 2013, from Biochemistry Dictionary: http://guweb2.gonzaga.edu/faculty/cronk/biochem/D-index.cfm?definition=disulfide_bond

9.  Yan, Y., & Chapman, B. (2008). Comparative Study of Distributed Resource Management Systems–SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx.

10. Li, Y., & Yaseen, A. (2012). Dinosolve. Retrieved from http://hpcr.cs.odu.edu/dinosolve/

4/09/2013 48

Page 49: ARMS  Active Resource Management Services For Big Data Processing

References for competition

11. Arvind Krishna, “Why Big Data? Why Now?”, IBM , 2011 URL: http://almaden.ibm.com/colloquium/resources/Why%20Big%20Data%20Krishna.PDF12. Yonghong Yan, Barbara M. Chapman, Comparative Study of Distributed Resource Management Systems - SGE, LSF, PBS Pro, and LoadLeveler, Department of Computer Science, University of Houston, May 2005 (pdf)13. Dr. Li’s site http://hpcr.cs.odu.edu/dinosolve/14. Scratch Predictor http://scratch.proteomics.ics.uci.edu/15. DiANNA server http://clavius.bc.edu/~clotelab/DiANNA/Portable Batch System (PBS)16. http://resources.altair.com/pbs/documentation/support/PBSProUserGuide12-2.pdf17. http://www.pbsworks.com/SupportDocuments.aspx?AspxAutoDetectCookieSupport=118. http://resources.altair.com/pbs/documentation/support/PBSProRefGuide12-2.pdf19. http://resources.altair.com/pbs/documentation/support/PBSProAdminGuide12-2.pdf20.http://www.pbsworks.com/(S(tykrsyqbemmlf3o5zwrmjrgf))/images/solutions-en-US/PBS-Pro_Datasheet-USA_WEB.pdf21.http://agendafisica.files.wordpress.com/2011/05/pbs.pdfMoab HPC Suite22.http://www.adaptivecomputing.com/publication/420/wppa_open/IBM Platform LSF23.http://public.dhe.ibm.com/common/ssi/ecm/en/dcd12354usen/DCD12354USEN.PDFApache Hadoop with Zookeeper24. http://zookeeper.apache.org/doc/current/zookeeperOver.html25. http://www.cloud-net.org/~swsellis/tech/solaris/performance/doc/blueprints/0102/jobsys.pdf

4/09/2013 49

Page 50: ARMS  Active Resource Management Services For Big Data Processing

Reference for 508 Compliance

26. http://en.wikipedia.org/wiki/Section_508_Amendment_to_the_Rehabilitation_Act_of_1973

4/09/2013 50

Page 51: ARMS  Active Resource Management Services For Big Data Processing

Appendix

• 52: Competition Matrix for Resource Management Systems

• 53-55: 508.22 Compliance Statistics for Dinosolve

4/09/2013 51

Page 52: ARMS  Active Resource Management Services For Big Data Processing

Competing Resource Management Systems

Features of systems

PBS LSF SGE

Supported platforms

Unix Unix & NT Unix

Multi-clustersupport

Yes Yes No

System level checkpoint

restart

No Yes Yes

User level checkpoint

restart

No No Yes

Large computational grid support

No No No

Massive Scalability

Yes Yes Yes

Parallel job support with Sun HPC ClusterTools

Loose Integration

Tight Integration Loose Integration

Distribution format of end

product

Source Binary only Binary and Source

Free? Yes No Yes

Posix 1002.2d compliance

Yes No Yes4/09/2013 52

Page 53: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 53

Page 54: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 54

Page 55: ARMS  Active Resource Management Services For Big Data Processing

4/09/2013 55