22
NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Embed Size (px)

Citation preview

Page 1: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

NGS induction --- case study: the BRIDGES project

Micha BayerGrid Services Developer, BRIDGES projectNational e-Science Centre, Glasgow Hub

Page 2: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

The BRIDGES project

Biomedical Research Informatics Delivered by Grid-Enabled Services

2 year e-Science project, started 1st October 2003

aim: provide data integration and grid-based compute power for Cardiovascular Functional Genomics project

CFG project investigates genetic predisposition for hypertensive heart disease

my role on project: develop grid applications for end users

Page 3: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

BRIDGES requirements and the NGS

functional:

high throughput compute tasks, e.g. large BLAST jobs

non-functional:

interfaces to applications should be targeted at the less computer literate --- users range in computer literacy from fairly advanced to mildly technophobic

security requirements should not cause any extra work or inconvenience for users as this may put them off altogether

resources provided by BRIDGES compete with familiar, similar resources already on offer at established bioinformatics institutions (EBI, NCBI, EMBL) -> need to make things “palatable” so people do use it

Page 4: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

How to get your job onto the NGS

NGS clusters

NGS portal

GSI-SSH

project portal

standalone GUI client

custom solutions:

standard solutions:

Leeds

Oxford

RAL

Manchester

Page 5: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Custom grid applications

if possible/appropriate, get a developer to write bespoke interface to a grid app running on NGS

only worthwhile if application is used frequently and/or by many users and is relatively unchanging/simple

best to hide complexity of grid from users altogether

users should not even have to choose between resources

automatic scheduling of jobs to resources that currently have spare capacity is desirable

best option for delivery is portlet in project-specific web portal – just need web browser for access then

Page 6: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Project web portals

portals are configurable, personalized collections of web applications delivered to a web browser as a single page

NGS encourage projects to maintain their own web portals to deliver apps to their users

applications can then be provided through user-friendly, specific portlet interfaces

allows the hiding of grid complexity from users

requires developer time

BRIDGES portal currently uses IBM Websphere (free to academia)

Page 7: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

More on portals

increasingly important technology – not just for grid computing (cf. Yahoo)

gives end users a customized view of software and hardware resources specific to their particular application domain

also provides a single point of access to Grid-based resources following user authentication (“single-sign-on”)

content is provided by portlets (Java servlet extension) – JSR168 standard provides for exchangeability

some portal packages currently available: IBM Websphere, Gridsphere, JetSpeed, uPortal, Jportlet, Apache Pluto

Page 8: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Authentication and User Management (1)

model adopted in BRIDGES:

requirement was for users not to have to obtain and manage certificates

we applied for a single project account at NGS – users do not need individual NGS accounts

this account maps to a single user (“BRIDGES”) on the NGS with home directories on all nodes (like normal users)

authentication for this user on NGS is by means of the host certificate of the machine where the jobs are submitted from (under control of BRIDGES project)

users authenticate via the BRIDGES web portal using standard username and password pairs

Page 9: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Authentication and User Management(2)

Users can create accounts for themselves in BRIDGES Websphere portal (“self-care”)

alternatively one could of course give the users usernames and passwords

information gathered is kept in Websphere's secure user database

current info is very basic but will be extended to include more detail (e.g. URL of user's project or departmental website where the user is listed)

provides at least a basic means of accounting for user activity

no need for physically visiting the Registration Authority/presenting ID

may need to resort to stricter security if system is abused e.g. if impersonation takes place etc.

Page 10: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Authorisation with PERMIS

PERMIS = grid authorisation software developed at Salford University (http://sec.isi.salford.ac.uk/permis/)

BRIDGES uses PERMIS to differentially allow users access to resources

typical use is with GT3.3 service but lookup-type use is also possible with other services (in our case GT3.0.2)

code in our service calls a PERMIS authorisation service running on a machine at NeSC

user's roles are queried and access to resource is permitted or denied accordingly

gives BRIDGES staff full control over who is allowed to use NGS resource through our applications

NGS

Leeds

Oxford

RAL

Manchester

end user

ScotGRID

NeSC Condor Pool

Page 11: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Security in BRIDGES – summary

NGS clusters

Leeds

Oxford

RAL

Manchester

end user BRIDGES web portal

NeSC machine with PERMIS authorisation service (GT3.3)

NeSC grid server with host credentials

authenticate at BRIDGES web portal with username and

password only

job request is passed on

securely with username

get user authorisations

make host proxy, authenticate with NGS

and submit job

Page 12: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Host authentication for job submission

allows us to submit jobs to NGS as user “BRIDGES”

apply for host certificate for the grid server machine as normal (UK e-Science Certification Authority)

results in a passwordless private key and host certificate for the machine

Java Cog kit code can then be used to generate a host proxy locally

this is used for job submission

Page 13: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Use case: Microarray reporter sequence BLAST jobs

microarray chips contain up to 400,000 reporter sequences

these need to be compared to existing annotated sequence databases

takes approx. 3 weeks to compute against human genome on average desktop machine

“Job processing – please wait....”(and wait....and wait....)

Page 14: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

BLAST

Basic Local Alignment Search Tool

used for comparing biological sequences (DNA, protein) against a set of target sequences

returns a sorted list of matches

most widely used algorithm for this sort of thing

compute intensive

Page 15: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

How do I get my application to run efficiently on a grid?

applications to be deployed on a compute grid need to be parallelised to really benefit (can of course just run them as single jobs too)

for this one must be able to partition a job into several subjobs

these then get processed separately at the same time on multiple processors

need to combine results of individual subjobs at the end

Page 16: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Parallel BLAST – grid style

partition your job by putting one or several query sequences into a separate input file (= 1 subjob)

distribute all input files, the executable and target data onto your grid clusters (“stage-in”)

results are returned to the server and combined there

if 100 free processors are available, and 100 subjobs are to be run, the time taken is 1/100th of the time it would have taken to run the whole job on a single machine (plus overheads for scheduling, data transfer and result combining)

Page 17: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

To stage or not to stage? file staging is the copying – at runtime – of files onto the remote

resource

example: BLAST jobswe need

input file target data file (“database” – really a flat text file) executable (BLAST)

target files and executable are unchanging components for this kind of job

it is best to store these locally on the remote resources to avoid staging overhead (target data are in the region of several gb in size and growing exponentially)

rather than individual users keeping multiple copies of publicly available data in their home directories, get sys admins to put up copies visible to all

must stage in input files since these vary from job to job

Page 18: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

BRIDGES GridBLAST Job Submission

end user machine

NESC Grid Server (Titania)

GT 3 core grid service

return result

send job request

PBS wrapper

BRIDGES Meta-

Scheduler

ScotGRID masternode

ScotGRID worker nodes

jobs farmed out to compute nodes

PBS server side

+ BLAST

NESC Condor pool

Condor Central Manager

Condor+

BLAST

GridBLAST client

Apache Tomcat Condor

wrapper

execution hosts

Oxford headnode

GT2.4+

BLAST

execution hosts

Leeds headnode

GT2.4+

BLAST

GT2.4 wrapper

NGS

Page 19: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Current status of our system

software is still at prototype stage – haven’t benchmarked any really big jobs yet

Java webstart client (launched from portal) connects to service – needs to be changed to portlet

user registration needs to be revised and users re-registered

happy to share portlet code etc with others once finished

Page 20: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

How we worked with the NGS

BRIDGES was one of the first projects doing bio stuff on NGS

we established a basic infrastructure needed for BLAST on the NGS clusters in collaboration with NGS user support

good collaboration on our security requirements – very helpful and accommodating

our project account is the first of its kind and we jointly tailored a solution that would fit BRIDGES

ask for what you need! things are not cast in stone and it is supposed to be a public service

Page 21: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Public bioinformatics infrastructure on NGS – current status

we are in the process of establishing an infrastructure for BLAST jobs that can be used by all

this includes:

making BLAST and mpiBLAST executables publicly available

mirroring the entire NCBI BLAST databases repository

currently trialling this on Leeds node – will be replicated at other nodes eventually

data replication on all nodes necessary to avoid severe performance hits

input from others needed and welcome!

Page 22: NGS induction --- case study: the BRIDGES project Micha Bayer Grid Services Developer, BRIDGES project National e-Science Centre, Glasgow Hub

Contact details

BRIDGES website: http://www.brc.dcs.gla.ac.uk/projects/bridges/

Code repository (available soon): http://www.brc.dcs.gla.ac.uk/projects/bridges/public/code.htm

BRIDGES web portal: http://europa.nesc.gla.ac.uk:9081/wps/portal

Contacts:

Micha Bayer at NeSC in Glasgow -- [email protected]

Richard Sinnott at NeSC in Glasgow -- [email protected]