39
1 Introduction to Galaxy Tutorial Research Informatics Support Systems Minnesota Supercomputing Institute University of Minnesota Version 1.1 10/15/2013 Copyright 2013 Regents of the University of Minnesota

the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

1

Introduction to Galaxy

Tutorial Research Informatics Support Systems Minnesota Supercomputing Institute University of Minnesota Version 1.1 10/15/2013

Copyright 2013 Regents of

the University of Minnesota

Page 2: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

2

1 Introduction ............................................................................................................................. 3

1.1 Scope of this tutorial ........................................................................................................ 3

2 Starting Galaxy ......................................................................................................................... 4

2.1 Accessing Galaxy .............................................................................................................. 5

2.2 Import Fastq files for one sample into current history ..................................................... 7

2.3 Set file attributes: database/build and datatype ............................................................. 8

3 Evaluating Fastq File Quality ................................................................................................... 9

3.1 Running fastQC ............................................................................................................... 10

3.2 Viewing FastQC Results .................................................................................................. 10

3.3 Reviewing FastQC ........................................................................................................... 11

4 Cleaning Fastq Datasets ......................................................................................................... 13

4.1 Introduction to Cleaning ................................................................................................. 13

5 Simple Removal of Low Quality Tails ..................................................................................... 14

5.1 FASTQ Quality Trimmer .................................................................................................. 14

5.2 Running FastQC on Trimmed Data ................................................................................. 15

5.3 Checking the FastQC Results .......................................................................................... 16

6 Simple Removal of Adapter Sequences ................................................................................. 17

6.1 Running Cutadapt........................................................................................................... 17

6.2 Running FastQC .............................................................................................................. 20

6.3 Resyncing Files ................................................................................................................ 20

7 Verifying Cleaning Results ..................................................................................................... 21

7.1 Reviewing FastQC ........................................................................................................... 21

8 Workflows .............................................................................................................................. 25

8.1 Extract workflow from current history ........................................................................... 25

8.2 Edit the workflow ........................................................................................................... 26

8.3 Running the Workflow ................................................................................................... 28

9 Cleaning up Histories ............................................................................................................. 30

9.1 Deleting Intermediate Files ............................................................................................ 30

9.2 Deleting Old Histories ..................................................................................................... 32

10 Sharing Your Work ................................................................................................................. 33

10.1 Sharing Workflows ......................................................................................................... 33

10.2 Sharing Histories ............................................................................................................ 34

11 Appendix I: Iterative tail/adapter removal ............................................................................ 35

Copyright 2013 Regents of

the University of Minnesota

Page 3: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

3

1 Introduction

1.1 Scope of this tutorial This is a practical, hands-on tutorial with two primary aims:

Give participants experience with the basic functionality of Galaxy o Starting Galaxy o Galaxy Layout o Loading files into current history o Creating reusable workflows o Sharing histories and workflows with others

Basic processing and quality control on sequencing datasets o Evaluating read quality o Adapter removal o Low quality read removal o Read trimming

Reference materials Galaxy screencasts: galaxycast.org

Copyright 2013 Regents of

the University of Minnesota

Page 4: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

4

2 Starting Galaxy

★ Galaxy Interface (Sect 2.1 on page 5)

Tools pane All of the software available in Galaxy is listed in the tools pane, grouped into categories. Click on the first tool category “Get Data” to expand it and show the individual tools in the category. Click the “Get Data” category again to collapse it.

Histories pane The history pane will show a list of all input and output files used in the current analysis

Center pane The center pane displays information for specific tools or files selected from the tools and histories panes.

★ Tutorial Dataset (Sect 2.2 on page 7) This tutorial will go over basic galaxy features and describe some strategies for cleaning up sequencing datasets. The sample dataset used in this tutorial was created from an SRA submission representing a shotgun re-sequencing of the human genome. These fastq files were artificially degraded to provide a single example containing a variety of potential quality issues. While the FastQC results closely resemble problems found in real datasets, not all datasets will be as recoverable as the tutorial dataset.

★ Data Libraries (Sect 2.2 on page 7) In Galaxy data files are stored in Data Libraries. Data libraries can be public (available to all Galaxy users) or private (available to the members of your lab). Sequence data generated by the UMGC can be loaded into Galaxy by sending a request to MSI, where they will be available in a data library. Additional users can be given access to a data library by sending a request to MSI. If you have a large dataset located on an MSI file system it can be loaded into a Galaxy data library by sending a request to MSI. Standard naming convention is to label the forward (or left) read as ‘R1’ and the reverse (or right) read as ‘R2’.

★ Galaxy File Attributes (Sect 2.3 on page 8) Galaxy does not rely on file extensions to determine file types. Instead, each file in Galaxy has a set of attributes that describe what format the file is in and what reference genome (if any) it is associated with. Fastq files, for example, come in several slightly different file formats. When a fastq file is uploaded to Galaxy it is identified as a generic fastq file. The Illumina fastq files used in this tutorial (as well as those currently generated by the UMGC) are in the Sanger fastq format. See en.wikipedia.org/wiki/FASTQ_format for more information.

Copyright 2013 Regents of

the University of Minnesota

Page 5: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

5

2.1 Accessing Galaxy a. Open a web browser and navigate to MSI Galaxy website galaxy.msi.umn.edu b. Log in with your MSI username and password c. The side panels can be collapsed via arrows in the bottom corners to provide a better

view of the center panel d. The total quantity of data you have stored in galaxy is displayed in the top right

★ Galaxy Interface

a

b

Tools pane Center pane History pane

c

d

Copyright 2013 Regents of

the University of Minnesota

Page 6: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

6

e. You can more easily find the tool you would like to use by searching for its name and keywords using the ‘search tools’ field. Throughout the tutorials we will direct you on how to find the tools manually but searching is often the quicker option.

e

Copyright 2013 Regents of

the University of Minnesota

Page 7: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

7

2.2 Import Fastq files for one sample into current history ★ Data Libraries

★ Tutorial Dataset

a. At the top of the screen select “Shared Data -> Data Libraries” b. Select “RISS-tutorial-galaxy101” from the list of data libraries c. Expand the “FastQ” folder and check the boxes next to the first two files

(“Tutorial_file_R1.fastq” and “Tutorial_file_R2.fastq”) d. Near the bottom of the page click the “Go” button to import the selected datasets to

the current history

a

b

c

d

Copyright 2013 Regents of

the University of Minnesota

Page 8: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

8

2.3 Set file attributes: database/build and datatype ★ Galaxy File Attributes

a. View the history panel by clicking on “Analyze Data” b. In the history pane click on the pencil icon next to the “2: Tutorial_file_R2.fastq” file in

order to set the file attributes c. This is human data so select “hg19_canonical” in the “Database/Build” box. A list of all

available databases will appear as you type d. Click save e. Click the Datatype tab f. Enter “fastqsanger” in the “New Type” box. A list of available data types will appear as

you type. g. Click save h. Repeat this process (steps b-g) for “1: Tutorial_file_R1.fastq”

a

c

d

e

f

g

b

. Repeat for Second Fileh

Copyright 2013 Regents of

the University of Minnesota

Page 9: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

9

3 Evaluating Fastq File Quality ★ Quality Scores (Sect 3.3 on page 11) The quality evaluation discussed in this tutorial is geared towards Illumina generated reads. Other sequencing methods may produce reads with a different pattern of errors or a different format of quality file. Other sequencing methods have their own subsections in the “NGS: QC and manipulation” section Fastq files include not only sequence information, but also information about the estimated chance of a miscalled base. This error estimation is referred to as a ‘quality score’. A quality score for a specific position is encoded as a single ASCII character. Since ASCII characters have a standardized numerical association (see http://en.wikipedia.org/wiki/ASCII) the specific character shown can be used to compute an expected error rate.

The exact error rate, P, is calculated as

where Q is the value of the ASCII character Since the first 32 ASCII characters are what are known as ‘control’ characters and don’t produce a visible character on a computer screen the modern version of fastq treats the 33

rd ASCII character as Q = 0 and counts up

from there. This is known as the Sanger phred encoding. Earlier Illumina machines treated the 64th

ASCII character as Q = 0. As a result care should be taken to determine the fastq version of older files. A quality score of 10 indicates an error rate of 10%, 20 is 1%, 30 is 0.1% and so on.

★ FastQC Metrics (Sect 3.3 on page 11) 1. Basic Statistics

Gives the name of the input file, encoding used for the quality score (Sanger vs older encodings), total sequence count, average sequence length and GC content percentage.

2. Per base sequence quality A particularly important figure showing the average quality score at different positions across all reads. In general, quality is lower at the start and ends of reads. Sudden dips in the middle of a read can signify failed cycles in the sequencing run.

3. Per sequence quality scores Histogram charting the average quality across a read. Bimodal distributions may indicate a subset of reads that are low quality and should be removed.

4. Per base sequence content The frequency of particular nucleotides at different positions in the reads. Extremely high nucleotide bias can be a sign of trouble. Short stretches with high bias can be caused by the presence of linkers, barcodes or adapters. There is usually some minor bias in the first 11-13bp of a RNA-seq experiment due to not-quite random hexamer sequence priming. This bias is accounted for in RNA-seq analysis software.

5. Per base GC content Average GC content by position in the read

6. Per sequence GC content Histogram showing the frequency of reads with certain GC%. Illumina sequencers tend to underrepresent extremely high and extremely low GC% sequences. Large deviations from the expected distribution can be a sign that the GC bias is having an effect on sequencing results.

7. Per base N content Rate of N (any nucleotide) calls by position in read.

8. Sequence Length Distribution Histogram of sequence lengths.

9. Sequence Duplication Levels Frequency of exact sequence duplicates in the dataset. High duplication rates can be caused by PCR artifacts and/or low library diversity.

10. Overrepresented sequences Calls out specific over-represented sequences.

11. Kmer Content

Shows the rate of over-represented k-mers in the dataset. K-mers over-represented at the 5’ and/or 3’ ends can be an indication of adapter contamination.

Copyright 2013 Regents of

the University of Minnesota

Page 10: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

10

3.1 Running fastQC a. From the “Tools” panel click on the “NGS: QC and manipulation” group b. Click “FastQC:Read QC” c. Select the file to analyze from the drop-down menu, in this case “Tutorial_file_R2.fastq” d. Rename the output file to be something recognizable (We used ‘Precleaning Right’) e. Click Execute f. Repeat these steps (a-e) on the Left read file “Tutorial_file_R1.fastq”

3.2 Viewing FastQC Results a. Click the ‘eye’ icon on “3: FastQCPrecleaningRight…” to view the fastQC results

a

b c

d

a

e

. Repeat for Second Filef

Copyright 2013 Regents of

the University of Minnesota

Page 11: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

11

3.3 Reviewing FastQC

★ Quality Scores

★ FastQC Metrics

a. Scroll to “Per base sequence quality” b. Low quality 3’ tail to a large numbers of reads c. Likely a failed cycle (one position is affected but the visualization is averaging over 4

cycles)

d. Scroll to “Per sequence quality scores.” e. Note the bimodal distribution with a population of low quality reads.

In the following review of the FastQC results we will present the results of both the Left and Right reads side by side. We do this to accentuate the difference between the two datasets and to highlight the importance of checking the quality of both sets of reads. It is quite normal for one set of reads to be considerably different in quality from the other. Usually the left reads are of higher quality due to the fact that they are sequenced first. The left and right reads are on the left and right respectively.

a

b c b c

e

d

e

Copyright 2013 Regents of

the University of Minnesota

Page 12: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

12

f. Scroll to “Sequence Duplication Levels” g. Note the presence of duplicate reads, here up to 5 copies. Some duplication is expected

and this is relatively low duplication. There may be issues if the ‘10+’ column is extremely high

h. Scroll down to “Kmer Content” i. Note the presence of over enriched k-mers at the 3’ end. This is indicative of 3’ adapter

contamination. These sequences may also be identified as an “Overrepresented Sequence”

f

g

h

i

g

i

Copyright 2013 Regents of

the University of Minnesota

Page 13: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

13

4 Cleaning Fastq Datasets

4.1 Introduction to Cleaning

★ Low Quality Tails/Cycles (Sect 5.1 on page 14) For a variety of reasons, including decay of reagents as they sit on the sequencer, the quality of base calls tends to decrease as sequencing progresses. As a result the 5’ ends will tend to have higher quality than the 3’ ends and forward reads will tend to have better quality than reverse reads. Because lower quality indicates a higher chance of a miscalled base, a large number of low quality base pairs can impair the accuracy of mapping algorithms. As a result when there is considerable evidence of low quality tails the tails are trimmed before further analysis is performed. A simple and effective method is to remove a fixed number of base pairs from the end of each read. However, removing the last 20% of all reads results in a 20% reduction in genome coverage. More sophisticated methods only remove the tails that show evidence of low quality. Both methods will be discussed within this section. A similar but distinct problem is the issue of failed cycles. On rare occurrences the sequencing machines may simply fail to properly incorporate nucleotides or fail to read the fluorescence of the most recent incorporation. This results in extremely low quality scores at a fixed position in all reads in the lane. For general alignment these failed cycles can be ignored but in specific instances it may be necessary to exclude that position from the analysis.

★ Adapter Contamination (Sect 6.1 on page 17)

Illumina DNA libraries consist of the DNA of interest (green) with ligated adapters (red + yellow) on the 5’ and 3’ ends to provide priming sites for the Illumina sequencing reactions.

The Forward adapter (left) provides a region (distal) that binds to the flow cell plate and a region (proximal) to which the sequencing primer binds.

The Reverse adapter (right) provides similar regions with the addition of a barcode sequence (yellow)

Adapter contamination occurs when the DNA fragment of interest is too short and the sequencing process begins to sequence the opposing primer region

Here you can see the distal region bound to the flow plate, the proximal region bound with a primer and an arrow showing the resulting sequence read.

In order to remove adapter contamination the adapters used in the experiment need to be supplied.

Depending on the experimental design adapter contamination can occur on the 5’ end, 3’ end or both. In the example shown here only the 3’ end is contaminated.

The standard Illumina TrueSeq adapters are: Forward 5’AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT Reverse 5' GATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG

Adapter

Copyright 2013 Regents of

the University of Minnesota

Page 14: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

14

5 Simple Removal of Low Quality Tails

5.1 FASTQ Quality Trimmer ★ Low Quality Tails/Cycles

a. Open the ‘NGS: QC and manipulation’ section of the toolbar b. Click ‘FASTQ Quality Trimmer’ c. Under “FASTQ File:” Select the ‘Right’ read file “2: Tutorial_file_R2.fastq” d. Set ‘Trim ends’ to “3’ only” e. Set ‘Window size’ to 3 f. Set ‘Quality Score’ to 20 g. Click ‘Execute’ h. Repeat previous steps (b-g) for “1: Tutorial_file_R1.fastq”

a

b

c

d

e

f

g

g

. Repeat for Second Fileh

Copyright 2013 Regents of

the University of Minnesota

Page 15: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

15

5.2 Running FastQC on Trimmed Data a. From the “Tools” panel click on the “NGS: QC and manipulation” group b. Click “FastQC:Read QC” c. Select the file to analyze from the drop-down menu, in this case “5: FASTQ Quality

Trimmer on data 2” d. Rename the output file to be something recognizable (We used ‘PostTrim R’) e. Click Execute f. Repeat these steps (a-e) on the Left read file “6: FASTQ Quality Trimmer on data 1”

a

b

c

d

e

. Repeat for Second Filef

Copyright 2013 Regents of

the University of Minnesota

Page 16: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

16

5.3 Checking the FastQC Results a. Click the ‘eye’ icon on “7: PostTrimR…” to view the fastQC results

b. Scroll down to “Per base sequence quality” c. Note that the quality of the tails has improved significantly

d. Scroll down to “Kmer Content” e. Observe that the high number of Kmers at the 3’ end have not been affected.

e

a

b

e

c c

d

Copyright 2013 Regents of

the University of Minnesota

Page 17: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

17

b

c

d

a

6 Simple Removal of Adapter Sequences

6.1 Running Cutadapt ★ Adapter contamination

a. Open the “Fasta manipulation” section b. Click “Cutadapt” to load the adapter trimmer tool c. Under “Fastq file to trim:” Select the ‘Right’ readset“5: FASTQ Quality Trimmer on data

2” d. Click “Add new 3’ Adapters”

e. In the box labeled “Choose 3’ adapter” enter the adapter sequence that would

contaminate the R2 (reverse/right) reads. For this tutorial and for standard Illumina runs this would be the “TruSeq Universal Adapter Reverse Complement”

e

Copyright 2013 Regents of

the University of Minnesota

Page 18: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

18

f. Set “Minimum overlap length” to 5

g. Set “Output filtering options:” to “Set Filters” h. Set “Minimum Length” to 25

i. Click “Execute” to run the tool

j. To quickly set up the ‘cutadapt’ run for the R1 (forward/left) reads we will apply a new technique. Click on the name of one of the ‘cutadapt’ results to expand its box.

k. Click on the blue circular arrow to load ‘cutadapt’ with all of the setting of the previous run.

f

g

h

i

k

j

Copyright 2013 Regents of

the University of Minnesota

Page 19: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

19

l. Change the “Fastq file to trim:” dropdown to “6: FASTQ Quality Trimmer on data 1” m. Change the “Choose 3’ adapter:” dropdown to the appropriate adapter contaminant.

For this tutorial and for standard Illumina runs this is the “TruSeq Index Adapter”

n. Click “Execute”

o. Open the ‘Cutadapt’ report by clicking on the ‘eye’ icon p. Note the length distribution of removed sequence, this gives an indication of the false

trimming rate.

p

o

l

m

n

Copyright 2013 Regents of

the University of Minnesota

Page 20: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

20

6.2 Running FastQC a. From the “Tools” panel click on the “NGS: QC and manipulation” group b. Click “FastQC:Read QC” c. Select the file to analyze from the drop-down menu, in this case “10: Cutadapt on data

5” d. Rename the output file to be something recognizable (We used ‘PostCutadaptR’) e. Click Execute f. Repeat these steps (a-e) on the Left read file “12: Cutadapt on data 6”

6.3 Resyncing Files

a. From the “Tools” panel click on the “MSI” group b. Click “resync: Paired-end resynchronization” c. Select as “Input 1,” “12: Cutadapt on data 6” d. Select as “Input 2,” “10: Cutadapt on data 5” e. Click on execute

Trimming and other quality control measures can result in reads, usually those of zero length, being removed from the dataset. Additionally, some processing steps may shuffle reads within the data files. Many programs expect to find the same read names in the same order for both the left and right readsets. To ensure that the read names are in ‘sync’ you can run the ‘re-sync’ tool.

a

b

c

d

e

a

b

d

c

e

. Repeat for Second Filef

Copyright 2013 Regents of

the University of Minnesota

Page 21: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

21

7 Verifying Cleaning Results

7.1 Reviewing FastQC a. Click on the eye icon next to ’13: PostCutadaptR…” to view the results of adapter

trimming. For illustration purposes we are showing the R1 and R2 reads’ results adjacent to one another.

b. Scroll down to ‘Per base sequence quality’ c. Note the improvement in the average tail quality

a

b

c

c

c

c

Precleaning

Postcleaning

Left Read Right Read

Copyright 2013 Regents of

the University of Minnesota

Page 22: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

22

d. Scroll down to “Per sequence quality scores” e. Verify removal of low quality peak

e

d

e e

Precleaning

Postcleaning

Left Read Right Read

Copyright 2013 Regents of

the University of Minnesota

Page 23: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

23

f. Scroll down to “Sequence Length Distribution” g. Note the degree of trimming that has occurred.

Precleaning

Postcleaning

Left Read Right Read f

g g

Copyright 2013 Regents of

the University of Minnesota

Page 24: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

24

h. Scroll down to ‘Kmer Content’ i. Note the absence of overrepresented tail kmers

i i

e i

h

Precleaning

Left Read Right Read

i

Postcleaning

Copyright 2013 Regents of

the University of Minnesota

Page 25: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

25

8 Workflows

8.1 Extract workflow from current history ★ Galaxy Workflows

a. At the top of the history pane click on the small gear icon and select “Extract Workflow” from the pop-up menu

b. In the “Workflow name” box enter “QC and Cleanup c. Click ”Create Workflow”

★ Galaxy Workflows (Sect. 8.1 on page 25) Galaxy workflows provide an easy method to automate an analysis pipeline. We will demonstrate how to generate a workflow, modify parts of the workflow and use it to analyze a second set of samples. Also, workflows and histories can be shared with other Galaxy users.

★ Workflow Parameters (Sect. 8.2 on page 27) The workflow we set up in this section will run FastQC, quality trimming and adapter trimming on both the left (R1) and right (R2) read sets. By default all parameters in workflow will be the same is those used in the history. For setting, such as the adapter’s sequences, that may be different for every analysis, you can make them settable at run-time.

b a

c

Copyright 2013 Regents of

the University of Minnesota

Page 26: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

26

8.2 Edit the workflow ★ Workflow parameters

a. Click on “Workflow” at the top of the Galaxy window b. Click on the workflow that was just created and select “Edit” from the drop-down menu

c. By dragging the boxes around you can make the workflow easier to interpret. d. Find one of the “input datasets” and find and click on the “FastQC:Read QC” run

attached to it e. Look on the “Details” panel, note whether this is the ‘Left’ or ‘Right’ read f. Click on the attached “Input dataset”

a

b

d

e

f

c

Copyright 2013 Regents of

the University of Minnesota

Page 27: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

27

g. Based on step e, Label the input. We labeled the input as “Left Read Input” do the same for the other “input dataset” naming it “Right Read Input”

h. Subsequent runs may require different adapter sequences. There is an option to allow the adapter to be set when the workflow is run. Click on a “Cutadapt” box

i. Click on the small downward adjacent to “Enter custom 3’ adapter sequence” j. Click the “Set at runtime” option k. Repeat these steps (h-j) for the other ‘Cutadapt’ box in the workflow. l. Click the ‘gear’ icon at the top of the workflow and select “Save”

h i

j

g

k. Repeat (h-j)

Copyright 2013 Regents of

the University of Minnesota

Page 28: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

28

8.3 Running the Workflow a. Click “Analyze Data” to return to your history

b. Before finishing with this history give it a name so you can find it again. We named our

history “Galaxy 101 History.” Name the history by clicking on the title location. (You must press enter to save the change. Clicking outside the box will not work)

c. Create a new history by clicking on the gear icon at the top of the history pane and selecting “Create New” from the pop-up menu

d. Name the new history “Workflow Test”

e. Import the ‘Tutorial_file_workflow_R1.fastq’ and ‘Tutorial_file_workflow_R2.fastq’ file by clicking on “Shared Data -> Data Libraries” at the top of the screen and selecting the files from the “RISS-tutorial-galaxy101” data library

f. Load a workflow by clicking on “Workflow” at the top of the screen

b

d

c

a

e

f

Copyright 2013 Regents of

the University of Minnesota

Page 29: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

29

g. Click on the workflow that was just created and select “Run” from the dropdown menu h. Select the “1: Tutorial_file_workflow_R1.fastq” file in the “Left Read Input” menu i. Select the “2: Tutorial_file_workflow_R2.fastq” file in the “Right Read Input” menu

j. Scroll down to “Step 8: Cutadapt” and at “Choose 3’ adapter” select the appropriate adapter, in this case “TruSeq Index Adapter”

k. Scroll down to “Step 10: Cutadapt” and at “Choose 3’ adapter” select the appropriate adapter, in this case “TruSeq Universal Adapter Reverse Complement”

l. Scroll down to the bottom of the main view and click “Run Workflow” m. Click “Analyze Data” to return to the history view.

g

h

i

j

k

l

Copyright 2013 Regents of

the University of Minnesota

Page 30: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

30

Running tools and workflows in Galaxy uses hard drive storage space, at times a lot of storage space. We estimate that even a relatively simple RNA-seq analysis will use 4-5 times the storage of the raw sequencing files. The good news is that most of these files are temporary file that can be safely deleted when the analysis is finished. If you decide later that the files are needed they can be quickly regenerated by re-running the workflow. You current galaxy-wide storage usage is shown in the top right corner of the galaxy window

Additionally, each individual history will show its storage usage at the top of the panel.

Note that if you are planning to extract a workflow from your current history you should do so before deleting the intermediate files.

We will demonstrate deleting intermediate files on the “Workflow Test” history you created as part of the tutorial.

9 Cleaning up Histories

9.1 Deleting Intermediate Files

a. Click on “Analyze Data” to return to the history view b. Click the ‘Gear’ icon to open the options menu and select “Saved Histories”

a

b

Copyright 2013 Regents of

the University of Minnesota

Page 31: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

31

c. Click on the arrow next to “Workflow Test” history and choose “Switch” d. Click the X to delete a file. We only need the final ‘resync’ files and can delete all others.

e. Note that the size of the history has not changed f. Click on the ‘gear’ icon and select “Include Deleted Datasets” g. Click the circled link to permanently ‘purge’ the datasets. h. Note that the storage usage has now been reduced.

f

d

c

h

e

g

Copyright 2013 Regents of

the University of Minnesota

Page 32: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

32

If you have histories you no longer need they can be deleted from the saved histories menu. Like deleting the intermediate files, you must ‘purge’ the histories in order to remove them from the storage space. Here we will delete the “Galaxy 101 History” we used to make our workflow

9.2 Deleting Old Histories

a. Click on “Analyze Data” to return to the history view b. Click the ‘Gear’ icon to open the options menu and select “Saved Histories”

c. Click the ‘downward arrow’ next to “Galaxy 101 History” and select “Delete Permanently.” The workflow we created from this history will be unaffected by this action.

d. Click “Ok” to permanently remove the history

a

b

c

d

Copyright 2013 Regents of

the University of Minnesota

Page 33: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

33

Galaxy provides tools for privately sharing your histories with other galaxy users. All you need to know is the email address of the user with which you want to share. In this section we will create a link to share our “QC and Cleanup” workflow as well as the final files in the “Workflow Test” history.

10 Sharing Your Work

10.1 Sharing Workflows a. Click on “Workflow” to switch to the workflow list

b. Click the ‘Downward Arrow’ next to “QC and Cleanup” and select “Share and Publish” c. Click “Make Workflow Accessible via Link”

d. Share this link with your collaborators, click the link to see what they would see.

e. Shared workflows can be saved or imported into your Galaxy account by clicking the icons in the top right corner

a

b

c

d

e

Copyright 2013 Regents of

the University of Minnesota

Page 34: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

34

10.2 Sharing Histories a. Click on “Analyze Data” to switch to your history b. Click the ‘Gear’ icon to open the options menu and select “Saved Histories”

c. Click the ‘Downward Arrow’ next to “Workflow Test” and select “Share or Publish” d. Click “Make History Accessible via Link”

e. Share this link with your collaborators, click the link to see what they would see.

f. Shared histories can be imported into your Galaxy account by clicking the icon in the top right corner

d

e

f

a

b

c

Copyright 2013 Regents of

the University of Minnesota

Page 35: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

35

11 Appendix I: Iterative tail/adapter removal

a. Open the ‘FASTA Manipulation’ section b. Click ‘Cutadapt’ to load the adapter trimmer tool c. Under “Fastq file to trim:” Select the ‘Right’ read file “Tutorial_file_R.fastq” d. Click “Add new 3’ Adapters” e. Under “Source” select “Enter custom sequence” f. In the box labeled “Enter custom 3’ adapter sequence” enter AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGCCATCAGCTGATGGCGCG

This is an advanced workflow for trimming low quality adapters and read tails. While producing largely the same result as the main tutorial this workflow provides a little extra sensitivity for trimming cases where a read has both an adapter and a very low quality tail. To illustrate the problem, consider the read below. The red section is the ‘genomic’ sequence, the green section is the ‘adapter’ sequence and the gradient denotes decreasing quality. If the adapter trimming is attempted first, the low quality may prevent the sequence from being recognized.

As such, we attempt to trim adapters, trim the low quality tails and then trim adapters again. This balances maintaining sufficient sequence for adapter recognition with removing low quality base pairs that may be errors.

a

b

c

d

e

f

Copyright 2013 Regents of

the University of Minnesota

Page 36: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

36

g. Set “Minimum overlap length” to 5 h. Set “Output filtering options:” to “Set Filters” i. Set “Minimum Length” to 15 j. Click ‘Execute’

`

g

h

i

j

Copyright 2013 Regents of

the University of Minnesota

Page 37: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

37

k

l

m

n

o

f

g

k. Open the ‘NGS: QC and manipulation’ section of the toolbar l. Click ‘FASTQ Quality Trimmer’ m. Set “FASTQ File” to “Cutadapt on data 2” n. Set ‘Trim ends’ to “3’ only” o. Set ‘Window size’ to 3 p. Set ‘Quality Score’ to 20 q. Click ‘Execute’

20

Copyright 2013 Regents of

the University of Minnesota

Page 38: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

38

r. Open the ‘FASTA Manipulation’ section s. Click ‘Cutadapt’ to load the adapter trimmer tool t. Under “Fastq file to trim:” Select the result from the previous cutadapt run, “7: FASTQ

Quality Trimmer on data 6” u. Click “Add new 3’ Adapters” v. Under “Source” select “Enter custom sequence” w. In the box labeled “Enter custom 3’ adapter sequence” enter

AATGATACGGCGACCACCGAGATCTACACGCCTCCCTCGCGCCATCAGCTGATGGCGCG

r

s

t

u

v

u w

u

Copyright 2013 Regents of

the University of Minnesota

Page 39: the Copyright Introduction to Galaxy · 4 2 Starting Galaxy ★ Galaxy Interface (Sect 2.1 on page 5) Tools pane All of the software available in Galaxy is listed in the tools pane,

39

x. Set “Minimum overlap length” to 5 y. Set “Output filtering options:” to “Set Filters” z. Set “Minimum Length” to 25 aa. Click ‘Execute’

`

x

y

z

aa

Copyright 2013 Regents of

the University of Minnesota