22
Galaxy Australia A public bioinformatics compute resource distributed nationally. Simon Gladman (presented by Andrew Lonie) Supported by Development partners

Galaxy Australia distributed nationally. › wp-content › uploads › 2019 › 05 › 2019-C3...150 cores - 60TB disk Pulsar-Mel ~90 cores 2018-now Pulsar-NCI ~60 cores. Galaxy Australia

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Galaxy AustraliaA public bioinformatics compute resource

    distributed nationally.Simon Gladman (presented by Andrew Lonie)

    Supported by

    Development partners

  • Outline

    ● A bit about Galaxy Australia

    ● Nationally distributed compute & Pulsar

    ● Dynamic job handling & DTD

    ● Reference data management & CVMFS

    ● Monitoring and stats collection

    ● Next?

  • What is Galaxy?

  • What is Galaxy?● Web based bioinformatics

    platform● Retains “histories” of analyses● Has large number of cutting

    edge tools● Conda environments/Singularity

    containers for dependency management

    ● Has an app store● Has a large support community

  • Galaxy Australia Infrastructure Locations

    Galaxy-QLD96 cores - 60TB disk

    Galaxy-Mel64 cores - 50TB disk

    2015-2017

  • Galaxy Australia Infrastructure Locations

    Galaxy AustraliaMain Queue150 cores - 60TB disk

    Pulsar-Mel~90 cores

    2018-now

    Pulsar-NCI~60 cores

  • Galaxy Australia - Usage

    • Public and free Galaxy server in Australia

    • Currently running > 50,000 jobs month.

  • Canberra Pulsar

    Head Node m2.xlarge

    Worker Nodem1.xxlarge

    NFS & CVMFS

    RefsWorker Node

    m1.xxlarge

    ...

    Worker Nodem1.xxlarge

    Melbourne Pulsar

    Head Node - m2.xlarge

    2 x web handlers (x6 threads)2 job mules - 1 farm

    Some cloudman changes

    16 vcpu | 64gb ram | 500gb /mnt

    DB Server - m1.large

    PostgreSQL100GB Volume

    4 vpcu | 16gb ram

    CEPH filesystem

    Worker Node - m1.xxlarge

    Slurm queue: main

    Head Node m2.xlarge

    Worker Nodem1.xlarge

    NFS & CVMFS

    Refs

    Workers are minimal GVL image, version controlled, updated, with telegraf.Can be swapped into cluster by draining and re-deploying.

    Training Jobs

    Kubernetes

    Integrated Environments

    Galaxy-Aus Architecture Brisbane

    Worker Node - m1.xxlarge

    Worker Node - m1.xxlarge

    ...

    Worker Node - m1.xxlarge

    Worker Nodem1.xlarge

    ...

    Worker Nodem1.xlarge

    Large/Long Jobs

    Slurm Queue

    Pulsar QueueDB/Disk Activity

    NFS Server

    Job Working

    TMP

  • Canberra Pulsar

    Head Node m2.xlarge

    Worker Nodem1.xxlarge

    NFS & CVMFS

    RefsWorker Node

    m1.xxlarge

    ...

    Worker Nodem1.xxlarge

    Melbourne Pulsar

    Head Node - m2.xlarge

    2 x web handlers4 job mules - 1 farm

    DB Server - m1.large

    PostgreSQL

    Worker Node - m1.xxlarge

    Slurm queue: main

    Head Node m2.xlarge

    Worker Nodem1.xlarge

    NFS & CVMFS

    Refs

    Workers are minimal GVL image, version controlled, updated, with telegraf.Can be swapped into cluster by draining and re-deploying.

    Training Jobs

    Galaxy-Aus Architecture Brisbane

    Worker Node - m1.xxlarge

    Worker Node - m1.xxlarge

    ...

    Worker Node - m1.xxlarge

    Worker Nodem1.xlarge

    ...

    Worker Nodem1.xlarge

    Large/Long Jobs

    Slurm Queue

    Pulsar QueueDB/Disk Activity

    GPFSStorage

    Services serverSlurm controller

    NFS server

  • What is Pulsar?

    ● Python server application● Allows a Galaxy server to run jobs on a remote system● No shared file system required● Configurable● Securable● Can submit jobs to HPC queueing system● Automatically handles dependency management

  • How Pulsar works

    1. User clicks “Execute”2. Galaxy packs up and sends:

    ○ Data○ Config files○ Tool name & version○ Parameters and other job metadata

    3. Pulsar accepts the job4. Pulsar checks if tool is installed locally

    ○ If not - Installs tool with Conda or Docker5. Pulsar submits job to local queue6. Pulsar waits until job complete7. Pulsar packs up result and sends it back to Galaxy

    Pulsar-Mel

  • What we use it for

    Pulsar-Mel

    ● Training jobs / Workshops

    ● Runs lots of small jobs● Usually 2-4 cores● Allows a tutorial or

    workshop to run without interfering with main queue

    Pulsar-NCI

    ● Large long running jobs● Jobs that could run for

    days● Usually 16+ core jobs● Lots of Disk-IO● Genome assemblies /

    Transcriptome assemblies

  • Dynamic Tool Destinations

    ● Tool vs Destination rules● YAML● Can have rules for

    ○ File size○ User○ Tool parameters

    ● More complex rules in python functions

    ● Can alter without Galaxy restart

  • The problem of reference data:There’s a lot and it’s hard to deal with

    1000’s of genomes availableMultiple sources

    > 200 available on Galaxy Australia and growing

    Multiple tool indices for each

    User requests

    Large compute burden to create Index Admin burden to manage

  • So we started talking with other large Galaxies!

    Share the burden

    ● Admin of references● Compute requirements

    We decided to share it all via CernVM-fs

    ● Currently controlled by the Galaxy Project● Moving to a community model soon

  • CVMFS global structureStratum 0 serverdata.galaxyproject.orgPenn State

    Stratum 1 serverMelbourne

    Stratum 1 serverGermany

    Galaxy EU

    Galaxy AU

    Stratum 1 serverElsewhere

    Other compute

    server

    GVL Instance

    Galaxy Main

    Stratum 0: The canonical sourceTransactional updates

    Stratum 1: Multiple serversMirrors Stratum 0 serverContinuous updates

    User servers: Many multiple serversMounts repo from stratum 1Based on GEO-APIWith fallback to other stratum 1s

    Primary mount Fallback mount

  • stats.genome.edu.au

  • StructureGalaxy Australia

    Head node

    Galaxy Australiaworker nodeGalaxy Australiaworker nodeGalaxy Australiaworker nodeGalaxy Australiaworker nodeGalaxy Australiaworker node

    Pulsar nodePulsar nodePulsar node

    Other serverOther server

    Other server

    stats.genome.edu.au

    telegrafRuns on each machine as a daemon

    Influx DBGalaxy-AusPulsar-MelPulsar-NCI

    Grafana - App & Web server

  • All very nice tech, but is it used??

  • What’s next?

    ● More compute nodes around the country

    ● Cloudstor?

    ● BYO resources

    ● More internationalisation of resources

  • https://twitter.com/galaxyaustraliahttps://www.embl-abr.org.au/galaxyaustralia/https://usegalaxy.org.au/