Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Galaxy for Proteomics Data Analysis: An Interactive Demonstration ASMS 2016 ANNUAL MEETING June 8 2016
Instructions for accessing the ASMS Galaxy-‐P Docker Container (also Section 7 below) Galaxy is now available in Docker containers. Docker containers are an easy way to package software for installation on other systems. The Docker Toolbox now includes Kitematic, a user interface for running Docker containers on Windows and Mac OS X systems. Kitematic makes it easy to run any published Docker container on these systems. To try a pre-‐configured Galaxy instance on your Mac OS X or Windows machine, follow these steps: 1. Install the Docker Toolbox on your computer (note you may need to enable Virtualization Technology for Docker to run. To do this on Windows, see: http://www.howtogeek.com/213795/how-‐to-‐enable-‐intel-‐vt-‐x-‐in-‐your-‐computers-‐bios-‐or-‐uefi-‐firmware/) 2. Once the Docker Toolbox is installed, launch Kitematic (the interface for downloading and running Docker containers). 3. Search for "asmsgalaxyp". This searches Docker Hub, a repository for Docker containers. Hit the “Create” button in the Docker container. Kitematic will download the container and install.
4. Once the instance has started (it may take a few minutes to load), click anywhere on the web preview pane (upper right of page), and you have a running Galaxy instance!
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
2
INDEX 1 Introduction 1.1 Scope and objective of this tutorial section..…………...…………………………………………………3 1.2 Outline of tutorial………………………………………………………………………………..............................3
2 Basics of the Galaxy user interface 2.1 Tool Panel, Viewing Pane, History Panel……………………………………………………..................3 2.2 Histories in Galaxy……....………………………………………………………………………………………....4
3 Generating a History I: Building a Protein Sequence Database 3.1 Getting the data: Shared data library……..…...………………………………………………………….….5 3.2 Using the FASTA database downloader Tool and editing History items…...…….…………..8 3.3 Using the Merge FASTA database tool…………………………………………………………………`.... 10 4 Generating a History II: Sequence Database Searching and Protein Identification 4.1 Using SEARCHGUI for sequence database searching on a Dataset Collection....….……..11 4.2 Using PeptideShaker for identifying peptides and proteins……..…………………………….…16 4.3 Galaxy functions: Viewing tool results, re-‐running steps in a History……………………….18 4.4 Extracting a workflow from a history…………………….……………………………..………………..21
5 PeptideShaker Outputs 5.1 PSM Report……………………………………………………………………………………………………………23 5.2 Current history …………………………………………………………………………………………………24 5.3 Import tutorial datasets into current history…………………………………………………………..27
6 Running a workflow 6.1 Inputs for the session workflow………………………………………………………………………………25 6.2 Workflow for the session ………………………………………………………………………………………25 6.3 Workflow functions……………………………………………………………………………………………....27 6.4 Running the workflow…………………………………………………………………………………………..30 6.5 Switching to a completed history…..……………………………………………………………………….33 6.6 Quick overview of history functions………………………………………………………………………..34 6.7 Generating a PSM summary of peptides derived from RNA-‐Seq derived db.…………….36 6.8 Converting peptide list into a FASTA format…………………………………………………….……37 6.9 BLAST-‐P searches and filtering…………………………………………………………………………….38 6.10 PSM Evaluation and Genome Visualization……..…………………………………………..……….40 7 Instructions for accessing the ASMS Galaxy-‐P Docker Container ……………………………41 8 Presenters and acknowledgements………………………………………………………………………..42
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
3
1 Introduction
1.1 Scope and objectives of this tutorial section
There are several objectives for this tutorial: ● Describe the basics of the Galaxy user interface ● Learn about Histories, workflows and related functions in Galaxy ● Learn how to generate a History ● Learn about useful functions in Galaxy for managing data and building analyses ● Learn about sharing Histories and workflows with other Galaxy users
More details on the workings of Galaxy are available online through the core Galaxy project at: https://wiki.galaxyproject.org/Learn
1.2 Outline of tutorial
2 Basics of the Galaxy user interface
2.1 The Galaxy user interface Galaxy employs a web-‐based user interface. The interface is accessed via a URL that directs users to either a locally installed instance or an instance running on a remote server. The diagram below shows the basics of the Galaxy user interface:
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
4
The Tool Pane displays an organized list of software tools available to users in a particular Galaxy instance. The layout of the tools view can be customized. New tools can be added to a Tool Pane, although this takes some advanced understanding of Galaxy. Initially, most Galaxy instances require users to register using an email or username and creating a password. This is necessary so that data analysis Histories and Workflows can be assigned to each individual user of an instance. Users can register by selecting the “User” dropdown menu above the Main Viewing Pane (sometimes called the Center Pane).
2.2 Histories in Galaxy In Galaxy, a record of any analysis run “lives” as a History. The History contains all the software tools used in an analysis, along with all parameters used for any software tool, as well as the input and output data from the analysis. Intermediate input and output data is also saved for each History item within a multi-‐step data analysis. Histories may be short (a few analysis steps) or very long (hundreds of sequential analysis steps). Histories are never deleted, but rather older Histories are saved when a user chooses to generate a new History for a data analysis. The active History is shown in the History Pane of the user interface.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
5
3 Generating a History I: Building a protein sequence database
3.1 Getting the data: Shared Data Library As a first step to getting familiar with Histories in Galaxy, we will build a simple History aimed at creating a protein sequence database that can be used for matching tandem mass spectrometry (MS/MS) data to peptide sequences. Ultimately we will use MS/MS data (with permission) from a published proteomics study in mice (J Proteomics Bioinform. 2014, 7: 1000302). First, let’s create a new History. Click on the “wheel” icon (History Options) in the History Pane. Then select “Create New” from the dropdown menu.
After creating a new History, you can re-‐name the History. Click on “unnamed History” and a text box will appear. Re-‐name the History a name of your choosing. Be sure to hit Enter after entering the name, or the name will not be changed.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
6
With the re-‐named (but still blank) History now in place, let’s go get some data to build a History for creating a protein sequence database. To start, we are going to utilize the Shared Data Library as a means to bring a dataset into a History. Select the “Shared Data” tab above the main viewing pane. Then select “Data Libraries”.
a) When the Data Library is loaded, click “Training data” à “ASMS”. Then select the file that ends in “Customized_Splice_isoform_Protein_Database.fasta”. Information on this file will be displayed in the Main Viewing Pane.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
7
b) Click on the button “to History” and this data file will be imported into your active History.
This imported data file is a FASTA formatted database of protein sequences generated from RNA-‐seq data (from J Proteomics Bioinform. 2014, 7: 1000302), focusing on possible proteins expressed from splice variants encoded in the transcriptomics data. Such a database can be used in proteogenomics studies, where MS/MS data from peptides can be used to confirm expression of novel protein sequence variants. More will follow up with more on this later in this tutorial. (For more information, see BMC Genomics, 2014, 15:703 that describes the use of Galaxy for generating and using these novel protein sequence databases). Once the splice isoform database is loaded, it will show up as item number 1 in the History. You may want to re-‐name these items something shorter and more informative (as has been done in the screenshot).
3.2 Using the FASTA database downloader Tool and editing History items With the splice isoform proteins loaded, we are next going to import two other protein sequence databases. One will be the Uniprot database of annotated and reviewed proteins known to be expressed in mice. The other will be a database of contaminant proteins known to be commonly found in proteomic samples. Ultimately, these three different databases will be merged, and used for matching of peptide sequences to the MS/MS data.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
8
First, let’s download the Uniprot Mouse Database. Follow these steps:
a) In the search box under “Tools” type “Protein Database Downloader” and double-‐click on the tool
b) From the flowing drop-‐down menu set the following parameters: Download From → UniProtKB Taxonomy → Mus musculus (Mouse) Reviewed → UniProtKB Proteome Set → Reference Proteome Set Include Isoform Data → Yes
c) Click “Execute”
After clicking Execute, a second step in the History will appear, labeled generically as “Protein database”. Next, let’s download the contaminant protein sequence database into the history.
a) Click again on the “Protein Database Downloader” tool
b) From the “Download from” drop-‐down menu select “cRAP (contaminants)”
c) Click Execute
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
9
Edit Attributes in Galaxy (i.e. renaming History items) After downloading the cRAP contaminants database, the History will contain three items. Steps two and three will be simply named “Protein Database”. To avoid confusion, let’s use the “Edit Attributes” function in Galaxy to re-‐name these History steps to something more informative. Next to each History step, you will see a pencil icon (Edit Attributes). When you click on this icon you an editing pane will appear in the Main Viewing Pane. A new name for the corresponding History item can be changed in this editing pane. Clicking on Save will change the name of the History item.
For this analysis, re-‐name History item 2 to something such as “Mouse Uniprot Database” and History item 3 to “Contaminant Database”.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
10
3.3 Using the Merge FASTA database tool As a final step let’s merge all three of our FASTA databases into a single database, that we can use for matching peptide sequences to MS/MS data.
a) In the search box under “Tools” type “FASTA Merge File and Filter Unique Sequences” and double-‐click on the tool and click “Add FASTA file”
b) From the drop-‐down menus select the following parameters: 1: Input FASTA Files → Mouse Uniprot Database 2: Input FASTA Files → Contaminant Database 3: Input FASTA Files → Customized_Splice_isoform_Protein_Database
c) Click Execute
A fourth History item will now appear, called “Merged and Filtered FASTA from data 1, data 3, and data 2. Use the Edit Attributes tool again to name this something more informative, such as “Merged Protein Database”.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
11
4 Generating a History II: Sequence Database Searching and Protein Identification
4.1 Using SEARCHGUI for sequence database searching of a Dataset Collection Now that we have made a merged FASTA sequence database, next we will use this database to match MS/MS to peptides sequence, via a sequence database search. For this, we will use the sequence database searching program called SEARCHGUI (Proteomics, 2011, 11:996-‐9). SEARCHGUI bundles several open-‐source and freely available sequence database searching programs, facilitating analysis of MS/MS data using more than one algorithm and increasing confidence in results. SEARCHGUI has been deployed in Galaxy. Here we will use it to match MS/MS spectra to sequences in our merged database. To carry out such a search, we will need data files containing MS/MS data. To import these into your History, click on the “Shared Data” dropdown menu, and select “Data Libraries” from this list. Click “Training data” à “ASMS” and select the file ending in “Example_MGF_File_1.mgf”. Click on the “to History” button to import it into your active History. Next, go back to the Data Library and click on the file ending in “Example_MGF_File_2.mgf”. Click on the “to History” button to import it into your active History. You now should have these two files added to your History (Items 5 and 6). You may want to re-‐name these items something shorter and more informative.
MGF files are “Mascot Generic Format” files, which have been converted from raw mass spectrometry data files. These files contain the peak list information from each MS/MS spectrum recorded in the raw data files, and are compatible for analysis using SEARCHGUI.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
12
Dataset Collections We have chosen to analyze two MGF files. For their analysis, we will use a Galaxy function called “Dataset Collections” to group these files into a single collection for analysis by SEARCHGUI. The Dataset Collections function is useful when a user needs to analyze multiple files using the same software tool and parameters. Once defined as a collection, the software tool will analyze each file within this collection using the same parameters, eliminating the need to set-‐up separate analysis steps for each file one-‐by-‐one (more information on Dataset Collections can be found here: https://wiki.galaxyproject.org/Histories#Dataset_Collections To define a Dataset Collection: a) Click on the check box (Operations on multiple datasets) button in your History b) Once selected, a check box will appear beside each item in your History. Check the boxes next to your two MGF files (History steps 5 and 6). c) Hit the button “For all selected files”. d) A dropdown menu will appear, where you will select “Build dataset list”. e) A dialogue window will appear. This shows the files that will be a part of the dataset list. There is also a window for naming this dataset collection. Enter “Collection of MGF files” here and click “Create List”. A new step in your History will now appear (Step #7), which is a Dataset Collection containing the two MGF files. f) Click on the “Operations on multiple datasets” again to leave the Dataset Collections and go back to normal History operations.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
13
Setting up SEARCHGUI for a sequence database search Now we will use the SEARCHGUI program to match MS/MS to these two MGF files. In the Tool pane search window, type in “searchgui”. Click on the SEARCHGUI tool, and a parameter window will be displayed in the Main Viewing pane. We will walk through a number of these settings in order to utilize SEARCHGUI on these example MGF files. To set-‐up the SEARCHGUI analysis follow-‐these steps:
a) In the “Protein Database” window, select “Merged Protein Database” (History Item 4)
b) Select “Yes” for “Create a concatenated target/decoy database before running PeptideShaker” (this must be checked for PeptideShaker to run successfully and estimate a false-‐discovery rate for peptide sequence matches, PSMs)
c) For the gene mappings window, select “no”.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
14
d) For Input Peak lists (mgf), first click the single folder button (Dataset collection). Then select the Dataset Collection of MGF files (Item 7 in History).
e) The “DB-‐Search Engines” window contains a selection of Sequence database searching programs that are available in SEARCHGUI. Any combination of these programs can be used for generating PSMs from MS/MS data. For the purpose of this tutorial, we will select all four available programs.
f) These values can be used for the following windows: -‐-‐ Precursor Ion Tolerance Units: Parts per million (ppm) -‐-‐ Precursor Ion Tolerance: 10 -‐-‐ Fragment Tolerance (Daltons): 0.1 (this is high resolution MS/MS data) -‐-‐ Enzyme: Trypsin -‐-‐ Maximum Missed Cleavages: 2
g) Scroll down the page. For the Fixed Modifications Window, three selections should be made in the input window: Carbamidomethylation of C, iTRAQ 4-‐plex of K, and iTRAQ 4-‐plex of peptide N-‐term. Typing the first few letters of each entry in the window will bring up each selection.
h) For Variable Modifications, select the following in the input window: Oxidation of M, and iTRAQ 4-‐plex of Y (a modification that sometimes occurs with iTRAQ).
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
15
i) For the remaining parameters, we will use the default values as they appear in the tool parameters.
j) Hit “Execute” to run SEARCHGUI.
A new History item (#8) will now appear, and be colored yellow, indicating that SEARCHGUI is running. The History item will turn green when the analysis is complete.
Once the database search is completed, the SEARCHGUI tool will output a file (called a SEARCHGUI archive file) that will serve as an input for the next section.
4.2 Using PeptideShaker for identifying peptides and proteins PeptideShaker ( Nat Biotechnol., 2015, 33:22-‐4) is a companion tool that works with output from SEARCHGUI. It serves to organize the PSMs outputted from SEARCHGUI, and contained in the SEARCHGUI archive, providing an assessment of confidence of the data, inferring protein identifies from the matched peptide sequences, and producing outputs that can be visualized by users to interpret results. PeptideShaker has been wrapped in Galaxy to work in combination with SEARCHGUI outputs. To use PeptideShaker to organize the results of our SEARCHGUI analysis, again go to the Tool window in the Tool pane and type in “PeptideShaker”. Click on PeptideShaker in the tool menu.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
16
Follow these steps to run PeptideShaker:
a) In the “Compressed SearchGUI results” field select item #8 in your search history (this is the SEARCHGUI archive file).
b) For the species type, select “No species restriction”.
c) For both the “Specify Advanced PeptideShaker Processing Options” and “Specify Advanced Filtering Options” fields select “Default” options.
d) The “Output Options” window shows the many options available for outputs from PeptideShaker. For this example, let’s select the following options for outputs:
-‐-‐ mzidentML File (a community standard for reporting sequence database search results)
-‐-‐ PSM report (all information about PSMs from SEARCHGUI, tabular text format)
-‐-‐ Peptide Report (all information on peptide sequences identified from PSMs, tabular text format)
-‐-‐ Protein Report (all information on inferred proteins from identified peptides, tabular text format)
-‐-‐ Certificate of Analysis (A text file with information on parameters used in PeptideShaker analysis and summary of results)
-‐-‐ Hierarchical report (An expanded output with information on proteins and peptides identified, tabular text format)
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
17
e) Hit “Execute”
A number of new items will appear in your History, each corresponding to the outputs selected in the PeptideShaker parameters. These will be colored yellow until the job completes and they will go to green.
4.3 Galaxy functions: Viewing and downloading results, editing History items, re-‐running analysis steps in a History Now that we successfully built a History to conduct sequence database searching and generate outputs of identified proteins and peptides, let’s take a look at some functions within the Galaxy framework that users may find highly useful. This is not a comprehensive listing of Galaxy functions, but some of those that may be of highest practicality and value.
i) Viewing and downloading results, re-‐running analyses. The results contained in a History item can be viewed by clicking on the name of the item. For example, let’s click on Item #9 in our History, the mzidentML file outputted. When this name is clicked, the History item expands to show information about the format of the file and other information. A number of additional buttons are also revealed for viewing information on the file.
Let’s focus on some useful functions within this expanded view.
a) “The eye”. Clicking on ‘the eye” (View data) provides a view of the formatted file contents in the Main Viewing pane, for compatible, non-‐binary formatted file types. Binary formatted file types are automatically downloaded when clicking on the View data button.
b) In the expanded view a Download button with a hard disk icon is available. Clicking this button will automatically download the file to the local hard drive Download folder.
c) A button containing the letter “i” (View details) is also revealed. Clicking the View details button will bring up a summary of information about this file such as format, size, data created, Galaxy tools used in its generation and also an inheritance chain, for files that were copied from other Histories.
d) A very valuable function is the re-‐run or “Run this job again” button containing the circular, two-‐arrow icon. Clicking on this button will bring up the tool parameters used for the initial analysis in the Main Viewing Pane. The tool can be executed again using these same parameters, or the parameters changed and the analysis re-‐run. A new History item will be produced with the output. This is an efficient way to test outcomes using altered parameters for a Galaxy tool.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
18
4.4 Extracting a Workflow from a History. Finally, let’s learn about a valuable function in Galaxy, extracting the workflow from a completed History. Workflows differ from Histories in that they are a series of defined
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
19
tools or actions, but lack the input and output data (https://wiki.galaxyproject.org/Learn/AdvancedWorkflow). Histories contain not only all the tools and actions, but also all the input and output data. Workflows can be easily extracted from a completed History. Click on the wheel icon (History options) at the top of the History pane, and select “Extract workflow” from the drop-‐down menu.
A workflow window will open in the Main Viewing Pane. Here, the name of the extracted workflow can be specified, and the tools included in the workflow can be selected. Clicking on “Create workflow” will create and store the specified workflow. The extracted workflow can be accessed by clicking on the “Workflow” tab in the Main Viewing Panel, and will be
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
20
listed under “Your workflows”. By clicking on any workflow in this list, you can choose to run the workflow, edit the workflow, or share the workflow with other Galaxy users.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
21
5. PeptideShaker Outputs This session of the workshop will take you through the processing of search results (PSM Report) generated via SearchGUI / PeptideShaker analysis. This will include a blueprint workflow for a) Generating a PSM summary of peptides derived from RNA-‐Seq derived db; b) Converting peptide list into a FASTA format (as an input for BLAST-‐P analysis); c) BLAST-‐P searches and filtering.
Outline of tutorial
Reference materials Salivary proteogenomics workflow manuscript: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4261978/http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4261978/ Hibernation proteogenomics manuscripts: http://www.ncbi.nlm.nih.gov/pubmed/26435507
http://www.ncbi.nlm.nih.gov/pubmed/26903422 Multi-‐omics overview:http://www.nature.com/nbt/journal/v33/n2/full/nbt.3134.html
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
22
Protein Report
- valid proteins - coverage - molecular weight
Peptide Report
- valid peptides - potential novel proteoforms based on
accession numbers - sequences - modifications and localization score - confidence
Spectrum (PSM) Report
- valid spectra - potential novel proteoforms based on
accession numbers - sequences - modifications and localization score - confidence - m/z, charge state, Δm/z
Summary (Parameters)
- valid peptides - valid proteins - valid spectra
Archive (zipped file)
- CPS file to visualize data mzIdentML
- PSM Visualization - SWATH Analysis - Skyline - Scaffold
5.1 PSM Report (PeptideShaker Output)
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
23
PSM Report c1: Column 1: Rank of protein group c2: Protein(s): Accession numbers of protein groups c3: Sequence: Amino acid sequence of the identified peptide c4: Variable modifications c5: Fixed Modifications c6: Spectrum File: Input MGF file of the identified PSM c7: Spectrum Title: Fraction number, scan number and charge state c8: Spectrum Scan Number c9: Retention Time c10: m/z: Mass to charge ratio c11: Measured Charge c12: Identification Charge c13: Theoretical Mass: Calculated from identified peptide sequence c14: Isotope Number c15: Precursor m/z Error [ppm] c16: Localization Confidence c17: probabilistic PTM score c18: D-‐score c19: Confidence c20: Validation: Confidence > 85 and delta ppm within 6 ppm are CONFIDENT PSMs
The PSM Report contains information about the peptide-‐spectral matching of all spectra within the dataset. The report contains Sequence of the peptide (c3), the
Spectrum scan information (c7) and its associated Confidence score (c19).
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
24
5.2 Current history �Your current status of history would be an output generated from searching 9 RAW files instead of two that were used in session 1. These would have a) Parameters, b) PSM Report and c) Protein Report outputs from PeptideShaker analysis. You will need to import tutorial datasets into your current history.
We will be processing the PSM Report and using its outputs.
5.3 Import tutorial datasets into current history
Tutorial Dataset: At the top click on “Shared Data” and then Histories.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
25
Select “History 3” from the list of histories.
Import the History into your account
6 Running workflow for this session 6.1 Inputs for the session 2 workflow
For Session 2, the inputs that would be needed are PSM Report. Read 2.3 to get the right inputs for this workflow.
6.2 Workflow for the session Select History 3 as your active history.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
26
Select Shared Data and click on workflows .
Select ASMS 2016:… workflow and import it into your account.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
27
6.3 Workflow functions
There are various options for using the workflow. We will be using Edit and Run later in the workshop. Here is a short description of some of the functions: 1) Share or Publish: An user can share a link to another user (who has an account on same server). The workflows can also be published for all of the users to view / use. 2) Download or Export: This feature gives you an ability to transfer workflows within two Galaxy instances. An user can download the workflow as a .ga file that preserves the names, parameters and sequences of tools that are used in a workflow. One can also download a
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
28
hyperlink that can be used to download a workflow. This function can also be used to store your workflows in myExperiment with your login and password. Lastly, this feature can be used to generate a workflow image for presentation or publication. 3) Copy: This function is used to copy workflow so that a modified version can be generated from a master copy. The modified workflow might have alternative parameters, tools or sequence of tools. 4) Rename: This is a function to change the name of the workflow. 5) View: This offers an ability to have a linear overview with parameters of the workflow along with annotations if any. 6) Delete: You can also delete older versions of workflow. Use this with caution! Might result in deleting hours of your work! Edit: This a powerful function that provides overview of the workflow. We can change parameters, names of outputs, and edit tools (add or remove) using the Edit mode. Click in Edit to open this option.
You can explore various options in edit mode including – renaming inputs, changing parameters, adding or removing tools, etc. However, please ensure that you DO NOT SAVE if you plan to use the current version of the workflow. However, if you would like to retain the changes that you have made – please do not forget to SAVE the workflow.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
29
Generating a PSM summary of peptides derived from RNA-‐Seq derived db
Converting peptide list into a FASTA
format
BLAST-‐P searches and filtering
Workflow for Session is a multi-step workflow. In this workflow, 33 processing steps are used to take the PSM Report from PeptideShaker and manipulate it to put through BLAST-P analysis to verify novel proteoforms.
Overview of Workflow Step 1: Input dataset (PSM Report)
- Steps 2-8: Selects peptides with accession number from RNASeq-derived protein FASTA file. (See Section 6.7 below for details)
- Step 9: PSM Report of peptides identified from RNASeq-derived proteins. - Steps 10-18: Conversion of peptide list into a FASTA format - Step 19: Short BLAST-P on NCBI remote nr mouse database - Step 20: BLAST-P on NCBI remote nr mouse database
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
30
- Steps 21-28: Identifies mismatched peptides. - Step 29: Peptides corresponding to novel proteoforms. - Steps 30-32: Conversion to PSM Report of peptides corresponding to novel proteoforms. - Step 33: PSM Report of peptides corresponding to novel proteoforms.
You can run a workflow through the EDIT interface or through the workflows interface. Let us use the Run function at the workflows session to run the workflow. Please remember that you should have your active history as the Input History to run the workflow.
6.4 Running the Workflow
Select ‘PSM Report’, for Step1.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
31
Run Workflow.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
32
If your workflow ran successfully, we will use the history to go through the steps. If not, then download the ‘History 4’ from Data Library.(then import and into Saved Histories)
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
33
6.5 Switching to a completed history
Unhide Hidden Datasets from your completed workflow OR from the “END History for Session 4” that you have as your current history. Once “unhidden” you should see 33 datasets within your history.
To view other steps in detail, search specific tools using the left panel by clicking on any of the eye icons. Go through all steps.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
34
6.6 Quick overview of history options
Once you click on the wheel sign in the top right corner, you will see multiple options as HISTORY LISTS, HISTORY ACTIONS, DATASET ACTIONS, DOWNLOADS and OTHER ACTIONS. Here is a brief overview of each of the options: HISTORY LISTS
• Saved Histories: Helps user to open all user histories in the main viewer pane / central pane.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
35
• Histories Shared with Me: User can access histories that have been shared by other user specifically using account on the same server. HISTORY ACTIONS
• Create New: Generates a new history. • Copy History: Copies all datasets in a history. • Share or Publish: Allows a user to share the history via a link or share with other
user by typing in his login email. • Show structure: Shows details of workflow parameters that were used in each step. • Extract Workflow: Helps extract a workflow for subsequent analysis on similar
datasets / replicates. (See section 4.4 for more information) • Delete: If you want to delete the history. (Use with caution!) • Permanently delete: If you really hate the history and want to permanently delete it
(Use with extreme caution!) DATASET ACTIONS
• Copy Datasets: Helps to copy selected datasets from one history to another. • Dataset Security: Can set permissions and roles to various users to access or edit the
history. • Resume paused jobs: Resumes jobs that have been paused. • Collapse Expanded Datasets: Helps in collapsing in expanded datasets. • Unhide Hidden Datasets: Helps in unhiding all the hidden datasets from a workflow
(See section 6.5 for details) • Delete Hidden Datasets: Deletes datasets that have been hidden. • Purge Deleted Datasets: Purges deleted datasets.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
36
6.7 Generating a PSM summary of peptides derived from RNA-‐Seq derived db
Step 8: Selects peptides with accession number starting with preB or proB. Step 9: PSM Report of potential novel PEPTIDES – click on eye icon to view details of the PSM Report in the main panel.
6.8 Converting peptide list into a FASTA format (as an input for BLAST-‐P analysis)
Let us focus on steps 2 to 12 Step 2: Removes the beginning line of the PSM Report (Now we are without headers and will need to use columns as our headers!) Note: To view details of a step, click on the step number and then click on the ‘rerun’ icon. DO NOT hit rerun though! Step 3: Sorts PSM Report with increasing Spectrum Title (column 7) ascending order and Confidence (column 19) in descending order. This ensures that the highest ranking PSM for that spectrum title is at the top. Step 4: Ranks columns based on the new sorting performed in Step 3. Step 5: Group -‐ helps in selecting only one PSM per Spectrum Title. Step 6 and 7: Join and Cut -‐ generates PSM Report of one PSM per spectrum title. Step 8: Selects peptides with accession number starting with preB or proB. (Details in figure below)
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
37
Let us focus on steps 10 to 18 Step 10: Generates a peptide list along with ranking number by cutting column 3 (c3) and column 21 (c21) from the step 9 tabular format file. Step 11: Generates a peptide list by cutting column 1 (c1) step 10 tabular format file. Step 12: Generates a FASTA format from Step 11 in the following format: >PEPTIDE PEPTIDE Step 13: Computes sequence length on data 12. Step 14: Generates a FASTA Format from Data 13. >PEPTIDE_length PEPTIDE Step 15: Filters sequences from length 8 to 30 aas from the list of sequences. Step 16: Filters sequences from length 31 to 50 aas from the list of sequences. Step 17: Converts Step 15 output to a format so that it can be searched by short BLAST-‐P search. >PEPTIDE_sequence length=length aa PEPTIDE Step 18: Converts Step 16 output to a format so that it can be searched by short BLAST-‐P search. >PEPTIDE_sequence length=length aa PEPTIDE
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
38
Step 17 and 18: Converts Step 15 and 16 output to a format so that it can be searched by using short BLAST-‐P search.
6.9 BLAST-‐P searches and filtering
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
39
BLAST-‐P Search and output processing is carried out from steps 19 to 33. Step 19: This step performs short BLAST-‐P search on peptide FASTA sequences and generates a XML output. The short BLAST-‐P uses parameters for short peptide sequences (8-‐30 aas). Please use the rerun option to look at the parameters used. Step 20: This step performs BLAST-‐P search on peptide FASTA sequences with 31 aas or longer and generates a XML output. The BLAST-‐P uses parameters for long peptide sequences (31-‐50 aas). Please use the rerun option to look at the parameters used.
Step 21 and 23: Converts BLAST XML output into a tabular output with various metrics such as a) ID of your sequence (c1); b) Percentage of identical matches (c3); c) Total number of gaps (c17) d) Alignment length (c4) and Query length (c23). Step 22 and 24: Query sequences with no hits for data 19 and 20 respectively. Step 25: Calculates percentage of alignment length versus actual query length and adds it as column 25. Steps 26: Selects peptides with -‐ Percentage of identical matches (c3) less than 100 OR Total number of gaps (c17) is at least one OR percentage of alignment length versus actual query length is less than 100. Steps 27 -‐29: Generates a list of peptides corresponding to novel proteoforms. Steps 30-‐33: Generates a PSM Report of peptides corresponding to novel proteoforms.
BLAST-‐P SEARCH
BLAST (Basic Local Alignment Search Tool) is a web-based tool used to compare biological sequences. BLAST-P, matches protein sequences against a protein database. More specifically, it looks at the amino acid sequence of proteins and can detect and evaluate the amount of differences between say, an experimentally derived sequence and all known amino acid sequences from a database. It can then find the most similar sequences and allow for identification of known proteins or for identification of potential peptides associated with novel proteoforms.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
40
6.10 PSM Evaluation and Genome Visualization Once peptides corresponding to novel proteoforms are identified they are subjected to Peptide-‐Spectral Match (PSM) evaluation. This involves PSM Visualization that reveals whether a reported high-‐scoring spectrum is in fact a result of several unmatched ions. Validation of PSMs is often considered the final step before reporting protein identifications. Good quality PSMs are subsequently placed on the genome. The localization of each peptide can reveal intriguing genomic architecture. In essence, proteogenomics involves the mapping of an experimental proteome to an established genome. Clustering of proteoforms in a particular genomic region may implicate a point of interest for further research. For an excellent review on proteogenomics read review by Nesvizhskii et al (2014).
What are Proteoforms? Due to the genomic complexity and redundancy of proteins and the associated post-translational modifications that can occur during or after their expression, there can be a number of proteoforms associated with a protein. A proteoform is the product that results from a protein’s specific genetic code and all the modifications molding it (e.g. post-translational modifications) or its transcription (e.g. alternatively spliced RNA and allelic variations). For more information about proteoforms please read manuscript by Smith and Kelleher (2013). Why are they so important? Proteoforms contribute to biological diversity. Because of chemical differences, proteoforms not only differ in structure, but in function as well. This leads to several different process modulations that affect cells differently, contributing to variation between and within individuals. Identifying Peptides Corresponding to Novel Proteoforms Proteoforms retain a lot of similarity with one another, which can make it hard to identify them from one another. Since the advent of proteomics, peptides corresponding to novel proteoforms are continually being identified after verification through BLAST analysis. Once validated, these proteoforms help in a more complete annotation of the genome and also identification of a role for such novel biomarkers in disease and physiological states such as cancer.
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
41
7 Instructions for accessing the ASMS Galaxy-‐P Docker Container Galaxy is now available in Docker containers. Docker containers are an easy way to package software for installation on other systems. The Docker Toolbox now includes Kitematic, a user interface for running Docker containers on Windows and Mac OS X systems. Kitematic makes it easy to run any published Docker container on these systems.
To try a pre-‐configured Galaxy instance on your Mac OS X or Windows machine, follow these steps: 1. Install the Docker Toolbox on your computer (note you may need to enable Virtualization Technology for Docker to run. To do this on Windows, see: http://www.howtogeek.com/213795/how-‐to-‐enable-‐intel-‐vt-‐x-‐in-‐your-‐computers-‐bios-‐or-‐uefi-‐firmware/) 2. Once the Docker Toolbox is installed, launch Kitematic (the interface for downloading and running Docker containers). 3. Search for "asmsgalaxyp". This searches Docker Hub, a repository for Docker containers. Hit the “Create” button in the Docker container. Kitematic will download the container and install.
4. Once the instance has started (it may take a few minutes to load), click anywhere on the web preview pane (upper right of page), and you have a running Galaxy instance!
ASMS 2016: Galaxy for Proteomics Data Analysis: An Interactive Demonstration
42
8 Presenters and acknowledgements
Presenters:
The main presenters are members of Galaxy-P research team at the University of Minnesota, working an ongoing project developing Galaxy for multi-omic applications (National Science Foundation Grant 1458524). We have in-depth experience in Galaxy and its use for multi-omics data analysis. (z.umn.edu/galaxypreferences).
Speakers in our session include:
● Tim Griffin, Professor, and Faculty Director, CMSP, University of Minnesota. Dr. Griffin is the Principal Investigator on the project developing Galaxy for multi-omics.
● Pratik Jagtap, Assistant Professor, and Managing Director, CMSP, University of Minnesota. ABRF Member. Member of Protein Research Group (PRG).
Thanks to
Also thanks to Amazon Web Services Education Research Grant.