57
MGT Documentation Michael Payne and Sandeep Kaur Oct 01, 2021

MGT Documentation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MGT Documentation

MGT Documentation

Michael Payne and Sandeep Kaur

Oct 01, 2021

Page 2: MGT Documentation
Page 3: MGT Documentation

Contents:

1 MGT overview 11.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 MGTdb website features 52.1 Navigating the website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Account features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Data visualisation, download and export 113.1 Tabular . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Interactive graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3 Report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.4 Downloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4 Generating alleles file locally 334.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Quickstart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.3 Inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.4 Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.5 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Setting up a local application 375.1 Required software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.2 Setting up databases and website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.3 Running the website . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Analysis pipeline 436.1 Reads to Alleles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.2 Alleles to MGT types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Database schema 47

8 Additional information 51

9 Indices and tables 53

i

Page 4: MGT Documentation

ii

Page 5: MGT Documentation

CHAPTER 1

MGT overview

MGT stands for multilevel genome typing, a novel technique to stably characterize bacteria (Payne et al. 2020). MGTis based on multilocus sequence typing (MLST). MGT provides multiple MLST-like schemes within the one system -thus enabling studying isolates at different levels of relatedness (such as closely related or distantly related).

Each isolate is assigned a sequence type (ST) at each of the MGT-levels, and the assigned ST at each level is theisolates genome type (GT). For example, an isolate SRR1966720 (as shown in the table in the screenshot below), isassigned an ST at each MGT level such as: 19-9-28-37-39-40-5344-6819-7219 - this is the GT of this isolate (similarto a barcode).

The identifier is assigned based on variations in certain grouped sets of genes (9 in this case, MGT1 to MGT9, hencethe 9 numbers in the identifier). The groups are defined based on mutational rates, where the earlier (/smaller) groupsdepict slower mutational rate changes compared to the later (/larger) ones.

A MGT barcode assigned to a bacterial genome is stable and can be used to uniquely and consistently identify strainswith a particular genome type. Isolates which are similar can be identified by the same ST-values at the MGT levels.The same STs have the same colours in the table (thus enabling easy visual identification). Additionally, to identifystrains which are closely related, clonal clusters (CCs) or outbreak detection clusters (ODCs) can be used. In thisclustering approach, STs which have atmost one allele difference for CCs, from any other STs are grouped together.ODCs are similar to CCs, however, here the maximum number of allele differences are indicated by the numberfollowing the ODC. E.g. ODC10 allows a maximum of 10 allele differences.

1

Page 6: MGT Documentation

MGT Documentation

This screenshot of MGTdb shows GTs of a few Salmonella Typhimurium isolates.

In this screenshot, the same strains as in the previous figure are shown, but instead of the stable MGT STs, shown arean unstable CCs at each MGT level, and ODCs which are calculated for the largest MGT level. CCs and ODCs enablefurther relatedness analysis.

2 Chapter 1. MGT overview

Page 7: MGT Documentation

MGT Documentation

1.1 References

Payne M, Kaur S, Wang Q, Hennessy D, Luo L, Octavia S, Tanaka MM, Sintchenko V, Lan R. Multilevelgenome typing: genomics-guided scalable resolution typing of microbial pathogens. Eurosurveillance. 2020 May21;25(20):1900519.

1.1. References 3

Page 8: MGT Documentation

MGT Documentation

4 Chapter 1. MGT overview

Page 9: MGT Documentation

CHAPTER 2

MGTdb website features

MGTdb is a webservice and database system which enables a user to upload isolates, obtain MGT assignments andexplore these in the context of the local and global database of isolates. MGTdb is currently available for threeorganisms namely, Salmonella Typhimurium (Payne et al. 2020) at https://mgtdb.unsw.edu.au/salmonella, SalmonellaEnteritidis (Luo et al. 2021) at https://mgtdb.unsw.edu.au/enteritidis and Vibrio cholerae (Cheney et al. 2021) athttps://mgtdb.unsw.edu.au/vibrio.

The MGT website is available at: https://mgtdb.unsw.edu.au

2.1 Navigating the website

The various features of the website are shown below.

2.1.1 Summary of global epidemeology

In a species homepage, top 5 ST (Sequence Type) distribution by the continents can be seen. Hovering reveals the STnumber as well as the number of isolates with this ST. In the figure below, the top ST value and count is shown forSouth America (ST=23, count=27).

5

Page 10: MGT Documentation

MGT Documentation

6 Chapter 2. MGTdb website features

Page 11: MGT Documentation

MGT Documentation

2.1.2 Search

As you navigate the website, at various points you can come across filtering isolates.

The isolate filter feature. Isolates can be searched for by metadata, name, MGT-ST assignment and MGT-CC assign-ment. Isolates can be searched for by multiples of these as well. On searching the results are displayed in a table, asshown in the image below.

These isolates can also be downloaded via the button shown at the bottom of the table. Currently, there is a maximumlimit of 10,000 for the download.

On selecting an isolate, its details, can be displayed, as shown below.

2.1. Navigating the website 7

Page 12: MGT Documentation

MGT Documentation

Features of this isolate can be selected by clicking on them (here selected are isolate type: “clinical” and Sequencetype: “6”). An exact mach is then performed, and resultant strains are shown in a table (similar to the previous figure).

2.2 Account features

You can create an account with MGT which enables you to submit isolates, and get an MGT assignment for them.

2.2.1 Creating an account

You can set up an account at the website. Simply click on Register at the top navigation bar, and enter in your details.

Doing so will send you an email with a link (check your junk email if you don’t receive an email within a few minutes).Click or copy and paste the link, and your account will be active, and you can log in with your supplied details.

8 Chapter 2. MGTdb website features

Page 13: MGT Documentation

MGT Documentation

Note: Certificates will be added soon to make your communication with the website httpsecure.

2.2.2 Uploading isolates

Once you log in, you can add projects, and isolates to a project.

Creating a project is straightforward. Simply click on ‘Projects’ in the top navigation bar. Then select the organismyou want to create a project for.

Clicking on ‘+ Add new project’ to create a new project. Only a project name is required to create a project.

Once a project is created, navigate to the project detail. Then clicking on ‘+ Upload new isolate’ enables you to addisolates.

Webpage to enable adding isolate to the MGT database. The relevant information can be supplied here.

2.2. Account features 9

Page 14: MGT Documentation

MGT Documentation

Note, that for the files to be uploaded, either Illumina sequenced files forward and reverse should be supplied, or allelesfiles. The advance of providing alleles file is that the uploaded file is a lot smaller (if internet speeds are an issue).

To generate the allele files locally see section Generating alleles file locally. Apart from the uploaded files, Collectionyear, Country, Countinent, Privacy status and Isolate name are compulsory fields.

Once information is received on the server, it is submitted as a job, where the alleles are first extracted (if reads areprovided), following which an MGT assignment is made and added to the database. Details about this process can befound on Analysis pipeline.

Once an MGT assignment is made, an email notification is sent.

2.2.3 Privacy

We take your uploaded isolates privacy very seriously. Your isolates are made public only if you specify. Furthermore,if you delete your isolates, then all associated isolate meta-data and the uploaded files are deleted.

Note that when a project is deleted, all the isolates in that project and their associated data are deleted.

2.2.4 References

Cheney L, Payne M, Kaur S, Lan R. Multilevel Genome Typing Describes Short-and Long-Term Vibrio choleraeMolecular Epidemiology. Msystems. 2021 Aug 24;6(4):e00134-21.

Luo L, Payne M, Kaur S, Hu D, Cheney L, Octavia S, Wang Q, Tanaka MM, Sintchenko V, Lan R. Elucidationof global and national genomic epidemiology of Salmonella enterica serovar Enteritidis through multilevel genometyping. Microbial genomics. 2021 Jul 22;7(7):000605.

Payne M, Kaur S, Wang Q, Hennessy D, Luo L, Octavia S, Tanaka MM, Sintchenko V, Lan R. Multilevelgenome typing: genomics-guided scalable resolution typing of microbial pathogens. Eurosurveillance. 2020 May21;25(20):1900519.

10 Chapter 2. MGTdb website features

Page 15: MGT Documentation

CHAPTER 3

Data visualisation, download and export

MGTdb enables a number of different visualisations and downloads.

11

Page 16: MGT Documentation

MGT Documentation

12 Chapter 3. Data visualisation, download and export

Page 17: MGT Documentation

MGT Documentation

3.1 Tabular

All initially-loaded, or filtered- isolates are shown in a table as below.

3.1. Tabular 13

Page 18: MGT Documentation

MGT Documentation

14 Chapter 3. Data visualisation, download and export

Page 19: MGT Documentation

MGT Documentation

Data shown in screenshot: https://mgtdb.unsw.edu.au/salmonella/isolate-list?country=Australia&searchType=and

Key features of the table are:

• The table shows 100 isolates at one time.

• All initially-loaded or filtered isolates can be sorted by clicking on a cell in the table header. By default, rowswithout any values (i.e. null values) are added to the end in the sorted result.

• The table shows the isolates and their metadata. By default the table shows STs assigned to the isolate at everyMGT level, and ODCs.

• The user can switch this view by clicking on the button ‘Clonal complexes view’ - this updates the table to showCCs at all MGT levels in place of STs.

• The cells containing the ST, CC and ODC values are coloured according to the identifier - this can enablerecognizing identical values across the table easily. This feature can be switched off by clicking on ‘Displaycolor’ when required.

• dST.

3.2 Interactive graphics

All isolates in the database, or filtered isolates can be graphically summarized. Clicking on ‘Graphical view’ loads apage which can summarise the data in three ways:

1. Distribution of counts of isolates with STs, CCs or ODCs over time or location

2. Distribution of counts of isolates with STs, CCs or ODCs over time and location

3. Distribution of isolate counts within particular STs, CCs or ODCs

3.2. Interactive graphics 15

Page 20: MGT Documentation

MGT Documentation

The first two utilize temporal and spacial metadata associated with isolates. The third does not utilize such metadata,but only utilizes the associated MGT assignments (STs, CCs and ODCs).

16 Chapter 3. Data visualisation, download and export

Page 21: MGT Documentation

MGT Documentation

3.2. Interactive graphics 17

Page 22: MGT Documentation

MGT Documentation

To visualise any of these graphs, the data needs to be initially loaded by clicking on ‘Load data’. The data is fetched,transformed and plotted (the first graphical view is shown below for all public data in the Salmonella Typhimuriumdatabase).

18 Chapter 3. Data visualisation, download and export

Page 23: MGT Documentation

MGT Documentation

3.2. Interactive graphics 19

Page 24: MGT Documentation

MGT Documentation

Key features of these graphs are as follows:

• Same colored bars indicate the same ST (or CC, ODC) value.

• Hovering over the bar reveals the ST (or CC, ODC) value.

• The buttons at the top of the graph enable the graph to be interactively explored. This re-transforms and re-plotsthe data.

• As not all isolates, for which the plot has been generated, will contain metadata annotations; the counts of thosethat do contain the required metadata and are included in the plot are indicated below.

• The displayed graph can be downloaded by clicking on the link ‘Download SVG’ below the plot. On mostbrowsers, simply clicking on the link can download the SVG, however, on some browsers, right-clicking on thelink triggers the download.

Note: It is strongly recommended that these graphs be used with data filtered to a smaller set (compared to the completedata in the databased). Otherwise, depending on your browser and computer, this process may take a long time, as thedata for plotting is loaded via javascript onto the users browsers, and rendering a plot with tens-of-thousands of barsmay be computationally intensive.

3.3 Report

A report can be generated on MGTdb for any given country (or project - if logged in). The report summarizes data atevery MGT-level in the past 10 years.

20 Chapter 3. Data visualisation, download and export

Page 25: MGT Documentation

MGT Documentation

3.3. Report 21

Page 26: MGT Documentation

MGT Documentation

For every organism, the ‘Summary report’ button in the header navigates to a page, where the user can select anycountry (or project) and generate the report.

Once the ‘Generate report’ button is clicked, MGTdb retrieves isolates with the requested metadata from the database,and generates the charts. The report depicts static charts - where the first four charts summarize data for all MGTlevels in general, followed by six charts summarising the data at each MGT level. The figure below shows the firstchart from the report.

22 Chapter 3. Data visualisation, download and export

Page 27: MGT Documentation

MGT Documentation

3.3. Report 23

Page 28: MGT Documentation

MGT Documentation

By scrolling all the way to the bottom of the report, users can find a button to download the report as an HTMLdocument.

3.4 Downloads

Filtered data (or all initially-loaded data; albiet with row limits) can be downloaded using buttons available below thetable (see Tabular for details of the table).

24 Chapter 3. Data visualisation, download and export

Page 29: MGT Documentation

MGT Documentation

3.4. Downloads 25

Page 30: MGT Documentation

MGT Documentation

3.4.1 Download the entire set as CSV

The set of searched (or initially loaded) isolates, along with the associated metadata, ST, CC and ODC assignments (i.ethe data shown in the table, in both the ST and CC views) can be downloaded as a CSV file. Currently, a maximum of1000000 isolates can be downloaded in one go. The larger the number of isolates requested for download, dependingon your internet speed, this process may take a few minutes.

3.4.2 Download MGT9 allelic profiles

The set of searched (or initially loaded) isolates, along with the highest level MGT allelic profiles can be downloadedin CSV format. Currently, allelic profiles for maximum of 10000 (the first 10000) isolates are downloaded via onerequest.

When the “In GrapeTree format” option is selected, the ST and dST columns are removed, and the negative alleles (i.e.alleles with missing information) are converted to positive alleles. This allows using the downloaded CSV file as inputto the GrapeTree tool (the STs, dSTs and other information downloaded from “Download the entire set as CSV” canbe used as metadata in GrapeTree). Thus, by visualising the downloaded files in, the epidemological investigationsrevealed through the minimal spanning tree built using allelic profiles of the downloaded isolates can be conducted.

A few minor modifications are required to the files to be used by GrapeTree:

1. For the allelic profiles file:

a. Add ‘#’ at the start of line 1.

b. Replace commas (‘,”) with tabs (‘\t’) - a find and replace-all will work.

c. Import as profile into GrapeTree and run the MST_v2 algorithm.

2. For the file downloaded via the button ‘Download the entire set as CSV’ - this can serve as the metadata file inGrapeTree:

a. Replace column name ‘Isolate’ to ‘ID’.

b. Import as metadata into graphtree.

In the future we plan to make these modifications directly in MGTdb.

3.4.3 View in Microreact

The set of searched (or initially loaded) isolates, along with the associated metadata, ST, CC and ODC assignments canbe downloaded as a CSV in Microreact format. Microreact is a powerful tool, that allows exploring the epidemeologyof the searched (or loaded) isolates in a map-based context.

26 Chapter 3. Data visualisation, download and export

Page 31: MGT Documentation

MGT Documentation

3.4. Downloads 27

Page 32: MGT Documentation

MGT Documentation

The isolates are sent by MGTdb to Microreact via Microreact’s API and the link returned by Microreact is displayedinstead of the button (as shown in the screenshot above). This link opens data in Microreact in a new tab (as shown inthe screenshot below). Current, there a limit of 2000 isolates per request (limit set by Microreact).

28 Chapter 3. Data visualisation, download and export

Page 33: MGT Documentation

MGT Documentation

3.4. Downloads 29

Page 34: MGT Documentation

MGT Documentation

3.4.4 Allelic profiles and alleles for all public isolates

Allelic profiles and allele sequences for all public available isolates can be downloaded from the organism home page(seen as blue links in the figure below). These files are currently uploaded daily.

30 Chapter 3. Data visualisation, download and export

Page 35: MGT Documentation

MGT Documentation

3.4. Downloads 31

Page 36: MGT Documentation

MGT Documentation

1. The allelic profile assignments compressed file lists all the publicly available isolates and their MGT9 allelicprofile assignments, in a tabular format.

2. The allele sequences is a compressed folder containtaining files of allele sequences of every locus that formspart of any MGT schema.

32 Chapter 3. Data visualisation, download and export

Page 37: MGT Documentation

CHAPTER 4

Generating alleles file locally

To reduce the amount of data to be uploaded to the MGT database some of the MGT pipeline processing can beperformed locally.

These steps include:

1. Species, serovar checking using kraken and SISTR (for salmonella)

2. Genome assembly using shovill and skesa

3. Genome QC

4. Extraction of alleles from genome using known allele fasta file

5. Assignment of 7 gene MLST sequence type

The resulting file is often several orders of magnitude smaller than the raw reads, facilitating rapid upload and analysis.

4.1 Installation

This pipeline has many dependencies so conda is the best way to handle them all. So the included .yaml file can beused to create the required environment that will need to be activated before running the script

1. Clone the repo

git clone https://github.com/LanLab/MGT_reads2alleles.git

2. Download latest miniKraken Database:

From the kraken website - https://ccb.jhu.edu/software/kraken/ (warning 2.9GB!)

OR

wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_4GB.tgz

3. unzip archive

4. Add database folder variable with:

33

Page 38: MGT Documentation

MGT Documentation

export KRAKEN_DEFAULT_DB="/home/user/minikraken_db_folder"

5. Install conda environment:**

install miniconda3 -> https://conda.io/miniconda.html

conda create -f /path/to/fq_to_allele.yaml -n deployable_fq_to_genome

This may take a while

conda activate deployable_fq_to_genome

4.2 Quickstart

python reads_to_alleles.py read_1.fastq.gz,read_2.fastq.gz ref_alleles.fastaoutput.fasta

The above script will run with all other settings including species and serovar as default (see below).

4.3 Inputs

Reads files

Paired end fastq files (gzipped or not) in format strain_name_1.fastq(.gz) and strain_name_2.fastq(.gz)

Reference alleles

Fasta file provided with script containing intact alleles for each locus (may be initial “1” alleles only or include otherintact alleles)

4.4 Outputs

An Alleles file in fasta format: strainID_alleles.fasta. 4 different types of “allele” are recorded.

1. A header stating the 7 gene MLST type predicted by mlst

2. A header in the format the locus:0_reason_for_failed_call to denote loci with uncallable alleles

3. A header in the format locus:allele to describe exact matches to alleles in the reference alleles file

4. A header in the format locus:new with sequence to describe new intact alleles or alleles with missing data

This allele file can be submitted (optionally with metadata) to the MGT database for full MGT assignment

4.5 Parameters

usage: reads_to_alleles.py [-h] -i INPUTREADS –refalleles REFALLELES -o OUTPATH [optional args]

required arguments:

-i INPUTREADS, --inputreads INPUTREADS Input paired fastq(.gz) files, comma separated(i.e. name_1.fastq,name_2.fastq ) (default: None)

--refalleles REFALLELES File path to MGT reference allele file. By default sistr results willbe used to determine which subfolder within the default folder (default:/species_specific_files/)

34 Chapter 4. Generating alleles file locally

Page 39: MGT Documentation

MGT Documentation

-o OUTPATH, --outpath OUTPATH Path to ouput file name,required=True (default: None)

optional arguments:

-h, --help show this help message and exit

-s SPECIES, --species SPECIES String to find in kraken species confirmation test (default:Salmonella enterica)

--no_serotyping Do not run Serotyping of Salmonella using SISTR (ON by default) (default:None)

-y SEROTYPE, --serotype SEROTYPE Serotype to match in SISTR, semicolon separated (default:Typhimurium;I 4,[5],12:i:-)

-t THREADS, --threads THREADS number of computing threads (default: 4)

-m MEMORY, --memory MEMORY memory available in GB (default: 8)

-f, --force overwrite output files with same strain name? (default: False)

--min_largest_contig MIN_LARGEST_CONTIG Assembly quality filter: minimum allowablelength of the largest contig in the assembly in bp (default: 60000)

--max_contig_no MAX_CONTIG_NO Assembly quality filter: maximum allowable number of con-tigs allowed for assembly (default: 700)

--genome_min GENOME_MIN Assembly quality filter: minimum allowable total assembly lengthin bp (default: 4500000)

--genome_max GENOME_MAX Assembly quality filter: maximum allowable total assembly lengthin bp (default: 5500000)

--n50_min N50_MIN Assembly quality filter: minimum allowable n50 value in bp (default: 20000)

--kraken_db KRAKEN_DB path for kraken db (if KRAKEN_DEFAULT_DB variable has alreadybeen set then ignore) (default: )

4.6 Examples

example1:

running strain 1234 against salmonella typhimurium MGT with 8 cores and 30gb RAM

python /path/to/reads_to_alleles.py 1234_1.fastq.gz,1234_2.fastq.gz MGT_alleles_file lo-cus_position_file output_file_name –serotype “Typhimurium;I 4,[5],12:i:-” –species “Salmonellaenterica” -t 8 -m 30

example2:

running strain abcd against vibrio cholerae MGT with 4 cores and 50gb RAM (serotyping is currently only forSalmonella)

python /path/to/reads_to_alleles.py abcd_1.fastq.gz,abcd_2.fastq.gz MGT_alleles_file lo-cus_position_file output_file_name –no_serotyping –species “Vibrio cholerae” -t 4 -m 50

4.6. Examples 35

Page 40: MGT Documentation

MGT Documentation

36 Chapter 4. Generating alleles file locally

Page 41: MGT Documentation

CHAPTER 5

Setting up a local application

We provide all code required to be able to set up your own MGT website and perhaps assign MGT for your ownorganism.

Setting up locally requires python3, pip3 (to install various python packages) and PostgreSQL database system.

5.1 Required software

5.1.1 1. PostgreSQL (database system)

Download and install the PostgreSQL database system by following instructions from their website (https://www.postgresql.org).

5.1.2 2. Python and pip

Python3 can be installed by following instructions on its website: https://www.python.org

Pip3 (python’s package manager) can be installed by following instructions on its website: https://pip.pypa.io/en/stable/

Then use pip3 to install the following libraries (substituting <library_name> for the libraries listed below):

pip3 install <library_name>

Libraries:

• django

• django-registration

• biopython

• django-tables2

• djangorestframework

37

Page 42: MGT Documentation

MGT Documentation

• django_registration

• django-countries

• psycopg2-binary

Additionally setting up the website via Apache.

Just a local testing version of the website can be run using django’s testing server. However, if you want to permanentlyhost your MGT website, we recommend setting up Apache to run the MGT website. The instructions for which canbe found at https://docs.djangoproject.com/en/2.1/howto/deployment/wsgi/modwsgi/.

5.2 Setting up databases and website

5.2.1 1. The code and variables to update

The code is available at the following git repository.

https://github.com/crystalsand10/MGT_code_release

Clone this repository. (If you don’t have git, download the code as a zip file from the repository location, and thenextract it.)

To get the website running a few variables need to be updated for your chosen species of interest.

1. Assuming you want to set up the database now for your organism called NewBacteria, then run the followingtwo commands:

find . -type f -exec sed -i.bak “s/Salmonella/NewBacteria/g” {} ;find . -type f -exec sed -i.bak “s/salmonella/newBacteria/g” {} ;

Make sure the cases are typed in correctly.

2. To run the MGT website, atleast two databases are required: one to store a user login information, and secondto store MGT isolate information for a single species (if you have multiple species, then you’ll have to set upmultiple databases).

In the set up below, newBacteria is the database name, and NewBacteria is the code application name.Substitute with which ever names you like (however make sure the names have no spaces). An exampleis using “Salmonella” as your application name, and “salmonella” as your database name.

Log into the installed PostgreSQL server, and create two new databases (if logged in via a terminal, thecommands are below):

CREATE DATABASE default;CREATE DATABASE newBacteria;

3. For the remaining variables to be updated follow the README.md in your downloaded MGT folder.

Then set up information in the databases as described below.

5.2.2 2. Setting up the databases

A link to the database schema for a new organism (e.g. newBacteria) is provided here Database schema.

a. Go to the downloaded code, such that you are in the same folder as the file manage.py. Then run databasemigrations as follows:

38 Chapter 5. Setting up a local application

Page 43: MGT Documentation

MGT Documentation

python3 manage.py makemigrations NewBacteriapython3 manage.py migrate --database=newBacteria

This will create initial tables.

b. Then create a new postgres user (the website user) and give it restricted access to use the newly created databasetables:

CREATE USER mgtWebsite WITH password '<PASSWORD>';GRANT SELECT ON ALL TABLES IN SCHEMA public TO mgtWebsite;GRANT INSERT, UPDATE, DELETE ON "Salmonella_isolate" TO mgtWebsite;GRANT INSERT, UPDATE, DELETE ON "Salmonella_project" TO mgtWebsite;GRANT INSERT, UPDATE ON "Salmonella_user" TO mgtWebsite;

c. Add data to the database. You will need to create a number of input files for this purpose and runs scripts asfollows:

1. Update the refFileInfo.json. Sample is available for download at json. In this file, provide infor-mation for your bacteria (this information is displayed at various points in the website). Multiplechromosomes (for example as found for Vibrio cholerae) can be added.

python3 populateReference.py ../ Mgt Salmonella Files/refFileInfo.→˓json

(Location of the chromosome file must be supplied in refFileInfo.json, which will be used tomove the file(s) to the location provided in SETTINGS.py)

Note: Chromosome is required, since some bacteria such as Vibrio have more than chromo-some.

2. Then add the loci that you’d like your MGT to be based on. An example file is available here. Thisis a simple table separated file with columns as follows:

python3 populateLoci.py ../ Mgt Salmonella Files/lociLocationsInRef.txt

The input file is a tab separated file, describing the loci locations as follows:

Column 1 = loci tag nameColumn 2 = loci start position in referenceColumn 3 = loci end position in referenceColumn 4 = gene direction with regards to referenceColumn 5 = chromosome number

3. Add the schemes

python3 populateSchemes.py ../ Mgt Salmonella Files/schemesInfo.txt Files/Schemes

schemesInfo.txt is a tab separated file, describing the following info:

Column 1 = Scheme name (e.g. MGT1, MGT2 etc; must not contain space)Column 2 = cut off threshold, i.e. maximum number of missing loci allowed.Column 3 = name of file containing the loci to be included in this scheme (the filecontains one loci tag name per line).

5.2. Setting up databases and website 39

Page 44: MGT Documentation

MGT Documentation

Column 4 = Scheme database.Column 5 = description of the scheme (optional).

4. Generate code for clonal complex tables and add clonal complex tables information to Tables_cc:

python3 setUpCcs.py ../ Mgt Salmonella Files/tables_ccs.txt> autoGenCcs.out.py

The tables_cc.txt file contains one clonal cluster information per line:

Column 1 = Scheme name (as provided in the previous step).Column 2 = What table should this ODC table be displayed in.Column 3 = The order of display in the corresponding table.Column 4 = The display name of the clonal cluster column.

e.g. stmcgMLST 2 4 “stmcgmlst 10 allele”

e.g. (when the same value is to appear in multiple tables) stmcgMLST 1,2 10,1 “stmcgmlst1 allele”,”stmcgmlst 1 allele”

Once run, copy and add the output (autoGenCcs.out.py) to NewBacte-ria/models/autoGenCcs.py and rerun migrations (step 2a).

5. Generate code for the allelic profiles tables, and the MGT table:

python3 setUpApsAndMgt.py ../ Mgt Salmonella Files/tables_aps.txt > autoGenAps

The tables_ap.txt file contains two columns:

Column 1 = scheme nameColumn 2 = scheme display order

Once again, copy and paste the output to Salmonella/models/autoGenAps.py and rerunmigrations (step 2a).

6. In the next few steps we add data into the various tables. One way to add alleles to the database is:

python3 addAlleles.py ../ Mgt Salmonella Files/Alleles/

The Alleles folder contains one fasta file for each of the loci. An example isSTMMW_14461.fasta.

7. Add snps:

python3 addSnps.py ../ Mgt Salmonella Files/snpMuts.txt

Here the snpMuts.txt file contains SNP mutations in a standard mutations format describedhere.

8. Populate allelic_profile tables:

python3 addAllelicProfiles.py ../ Mgt Salmonella Files/schemeToApMapping.txt Files/AllelicProfiles

The schemeToApMapping.txt file contains a simple mapping from the scheme nameto its corresponding allelic profiles file name.

40 Chapter 5. Setting up a local application

Page 45: MGT Documentation

MGT Documentation

The allelic profiles file contains an allelic combination mapped to a uniqueST and dST.

9. Populate clonal complex tables and assign them to allelic profiles:

python3 addClonalComplexes.py ../ Mgt Salmonella Files/ccInfo.txt Files/ClonalComplexes

The ccInfo.txt contains information regarding the files for each clonal complex, andthe clonal complex itself added to the database earlier. The columns are:

Column 1 = scheme nameColumn 2 = A file containing information regarding the clonal complex assignment to anallelic profile (e.g. MGT7_cc.txt).Column 3 = A file containing information regarding newly computed clonal complexmerges (e.g. MGT7_cc_merges.txt).Column 4 = Format of tableNumber_orderNumer.

10. Next, we need to make sure that a user is registered, to add isolates and associated metadata. To doso, run the web application and register.

The web application can be run locally as: python3 manage.py runserver

A dummy email server can be run locally as: python -m smtpd -n -cDebuggingServer localhost:25

11. Then, using the registered user name, we populate the isolate tables:

python3 addIsolates.py ../ Mgt Salmonella Files/isolate_info.tab

An example file is provided here. Depending on how you organise your isolate_info.tabfile, the columns numbers of each metadata need be changed at the start of the addIsolate.pyscript.

12. Next we populate the MGT table, and assign isolates to MGT:

python3 addMgts.py ../ Mgt Salmonella Files/mgt_annotations.tab

A sample mgt_annotations.tab is provided here.

13. Finally, we generate a view table, which contains allelic profiles and clonal complexes for easyquery.

python3 genViewSqlAndClass.py ../ Mgt Salmonella mlstWebsite

When this script is run, two files are written out:

1. “runOnDb.sql” : run the two sql statements in postgresSql (can follow the method in14. below).

2. “autoGenView” : copy and paste this to autoGenViews.py in the models folder.

14. Run postgreSQL commands from file:

psql -U <postgresUsername> -d newBacteria -a -f runOnDb.sql

5.2. Setting up databases and website 41

Page 46: MGT Documentation

MGT Documentation

5.3 Running the website

To run the website locally simply:

‘‘ python3 manage.py runserver ‘‘

Otherwise in the settings set DEBUG=False and follow the instructions in section “Additionally setting up the websitevia Apache”.

Remember, you should set up regular backups of your data in the database.

42 Chapter 5. Setting up a local application

Page 47: MGT Documentation

CHAPTER 6

Analysis pipeline

6.1 Reads to Alleles

The reads to alleles pipeline has two components:

1. Reads -> Genome

2. Genome -> Alleles

This section of the pipeline can be run locally or as part of the full pipeline on the MGT server.

• When a user uploads raw paired end reads this section is run together with the rest of the pipeline.

• When run locally it produces an strainID_alleles.fasta file which can be submitted to the MGT server. Thisapproach saves upload time for large read sets and can speed up the processing time in the database. SeeGenerating alleles file locally for more information.

6.1.1 Reads to Genome

1. Input reads are processed by kraken and samples with > 10% reads from non-target species are recorded.

2. Reads are then passed to the shovill pipline using SKESA for assembly.

3. The assembly is then filtered for quality using species specific metrics produced by assembly-stats.

4. If Serotyping is enabled (species dependent) SISTR is used to predict serotype from the assembly.

5. If the genome passed all filters it is passed to allele calling

6.1.2 Genome assembly to alleles

1. 7 gene MLST ST is identified using mlst.

2. The genome is used to search a collection of alleles (at least one per locus) using BLASTn.

43

Page 48: MGT Documentation

MGT Documentation

3. Results from this BLAST are used to extract loci from the genome if they are intact or to reconstruct the loci ifthey are disrupted in some way.

4. Extracted, reconstructed and uncallable alleles, as well as 7 gene ST, are written to a fasta file which is passedto the next stage of the pipeline.

Stages of the reads to allele pipeline are displayed in the following flow diagram.

6.2 Alleles to MGT types

This section of the pipeline must be run in conjunction with an MGT postgres server as it reads and writes from/to thedatabase heavily.

Stages of the alleles to MGTdb pipeline are displayed in the following flow diagram.

Terminology

• Negative alleles = alleles that have missing data denoted in sequence by ‘N’ may be multiple alleles with samecall but different missing regions, assigned as -4_1, -4_2 for 2 different negative alleles that show no mutationsfrom allele 4.

• New positive alleles = intact alleles that do not match reference alleles (used to search in reads to alleles section).

• Positive matches = alleles that are intact and match to existing reference alleles.

• AP = Allele(ic) profile collection of all alleles used to define the ST for a given MGT scheme.

• CC = Clonal complex (single linkage groups).

• ST = Sequence types.

• dST = Degenerate sequence types - sequence types that are the same as each other except for alleles withdiffering negative alleles or 0s i.e. ST 4 dst 1 (written 4.1) and ST 4 dst 2 (written 4.2) have APs that differ onlyby 0s or different negative versions of the same allele.

• ODC = Outbreak Detection Clusters (single, double, 5 and 10 cutoff linkage groups derived from MGT9 alleleprofiles).

• CCs and ODCs are assigned usign the same method with different cutoff values so they are often shown in thesame processes.

44 Chapter 6. Analysis pipeline

Page 49: MGT Documentation

MGT Documentation

Fig. 1: Flow diagram of Illumina reads to initial allele calls file

6.2. Alleles to MGT types 45

Page 50: MGT Documentation

MGT Documentation

Fig. 2: Flow diagram of initial allele calls to MGT assignments

46 Chapter 6. Analysis pipeline

Page 51: MGT Documentation

CHAPTER 7

Database schema

Data for each organism is stored in its own postgres database. The schema is as follows:

47

Page 52: MGT Documentation

MGT Documentation

48 Chapter 7. Database schema

Page 53: MGT Documentation

MGT Documentation

The number of schemes (or levels) defined for the organism are stored in the Scheme table. Allelic profiles for eachMGT level are stored in separate tables indicated by apN_L. Clonal complexes for each MGT level, and at each levelof threshold are stored in separate tables as well, indicated by ccJ_K.

User information is stored in a separate database (schema not shown here).

49

Page 54: MGT Documentation

MGT Documentation

50 Chapter 7. Database schema

Page 55: MGT Documentation

CHAPTER 8

Additional information

Setting up readthedocs on your mac: https://docs.readthedocs.io/en/latest/intro/getting-started-with-sphinx.html

RST (ReStructured Text format) information:

http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#rst-primer

Links to MGT readthedocs page:

• http://mgt-docs.readthedocs.io

• http://mgt-docs.rtfd.io

51

Page 56: MGT Documentation

MGT Documentation

52 Chapter 8. Additional information

Page 57: MGT Documentation

CHAPTER 9

Indices and tables

• genindex

• modindex

• search

53