42
Galaxy for Proteomics Data Analysis: An Interactive Demonstration ASMS 2016 ANNUAL MEETING June 8 2016 Instructions for accessing the ASMS GalaxyP Docker Container (also Section 7 below) Galaxy is now available in Docker containers. Docker containers are an easy way to package software for installation on other systems. The Docker Toolbox now includes Kitematic, a user interface for running Docker containers on Windows and Mac OS X systems. Kitematic makes it easy to run any published Docker container on these systems. To try a preconfigured Galaxy instance on your Mac OS X or Windows machine, follow these steps: 1. Install the Docker Toolbox on your computer (note you may need to enable Virtualization Technology for Docker to run. To do this on Windows, see: http://www.howtogeek.com/213795/howtoenableintelvtxin yourcomputersbiosoruefifirmware/) 2. Once the Docker Toolbox is installed, launch Kitematic (the interface for downloading and running Docker containers). 3. Search for "asmsgalaxyp". This searches Docker Hub, a repository for Docker containers. Hit the “Create” button in the Docker container. Kitematic will download the container and install. 4. Once the instance has started (it may take a few minutes to load), click anywhere on the web preview pane (upper right of page), and you have a running Galaxy instance!

Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  ASMS  2016  ANNUAL  MEETING  June  8  2016  

Instructions  for  accessing  the  ASMS  Galaxy-­‐P  Docker  Container  (also  Section  7  below)  Galaxy  is  now  available  in  Docker  containers.    Docker  containers  are  an  easy  way  to  package  software  for  installation  on  other  systems.  The  Docker  Toolbox  now  includes  Kitematic,  a  user  interface  for  running  Docker  containers  on  Windows  and  Mac  OS  X  systems.  Kitematic  makes  it  easy  to  run  any  published  Docker  container  on  these  systems.    To  try  a  pre-­‐configured  Galaxy  instance  on  your  Mac  OS  X  or  Windows  machine,  follow  these  steps:  1.  Install  the  Docker  Toolbox    on  your  computer  (note  you  may  need  to  enable  Virtualization  Technology  for  Docker  to  run.    To  do  this  on  Windows,  see:  http://www.howtogeek.com/213795/how-­‐to-­‐enable-­‐intel-­‐vt-­‐x-­‐in-­‐your-­‐computers-­‐bios-­‐or-­‐uefi-­‐firmware/)  2.  Once  the  Docker  Toolbox  is  installed,  launch  Kitematic  (the  interface  for  downloading  and  running  Docker  containers).  3.  Search  for  "asmsgalaxyp".    This  searches  Docker  Hub,  a  repository  for  Docker  containers.  Hit  the  “Create”  button  in  the  Docker  container.    Kitematic  will  download  the  container  and  install.    

   4.  Once  the  instance  has  started  (it  may  take  a  few  minutes  to  load),  click  anywhere  on  the  web  preview  pane  (upper  right  of  page),  and  you  have  a  running  Galaxy  instance!  

   

Page 2: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  2  

 INDEX    1   Introduction    1.1   Scope  and  objective  of  this  tutorial  section..…………...…………………………………………………3    1.2   Outline  of  tutorial………………………………………………………………………………..............................3    

 2   Basics  of  the  Galaxy  user  interface            2.1          Tool  Panel,  Viewing  Pane,  History  Panel……………………………………………………..................3    2.2   Histories  in  Galaxy……....………………………………………………………………………………………....4  

 3   Generating  a  History  I:    Building  a  Protein  Sequence  Database            3.1  Getting  the  data:  Shared  data  library……..…...………………………………………………………….….5            3.2      Using  the  FASTA  database  downloader  Tool  and  editing  History  items…...…….…………..8            3.3      Using  the  Merge  FASTA  database  tool…………………………………………………………………`....  10              4   Generating  a  History  II:    Sequence  Database  Searching  and  Protein  Identification            4.1      Using  SEARCHGUI  for  sequence  database  searching  on  a  Dataset  Collection....….……..11            4.2      Using  PeptideShaker  for  identifying  peptides  and  proteins……..…………………………….…16  4.3      Galaxy  functions:  Viewing  tool  results,  re-­‐running  steps  in  a  History……………………….18  4.4        Extracting  a  workflow  from  a  history…………………….……………………………..………………..21        

5   PeptideShaker  Outputs  5.1   PSM  Report……………………………………………………………………………………………………………23    5.2   Current  history  …………………………………………………………………………………………………24  5.3        Import  tutorial  datasets  into  current  history…………………………………………………………..27  

 6            Running  a  workflow                6.1          Inputs  for  the  session  workflow………………………………………………………………………………25            6.2          Workflow  for  the  session  ………………………………………………………………………………………25            6.3          Workflow  functions……………………………………………………………………………………………....27            6.4            Running  the  workflow…………………………………………………………………………………………..30            6.5          Switching  to  a  completed  history…..……………………………………………………………………….33            6.6          Quick  overview  of  history  functions………………………………………………………………………..34            6.7          Generating  a  PSM  summary  of  peptides  derived  from  RNA-­‐Seq  derived  db.…………….36              6.8          Converting  peptide  list  into  a  FASTA  format…………………………………………………….……37            6.9          BLAST-­‐P  searches  and  filtering…………………………………………………………………………….38                6.10      PSM  Evaluation  and  Genome  Visualization……..…………………………………………..……….40    7            Instructions  for  accessing  the  ASMS  Galaxy-­‐P  Docker  Container  ……………………………41    8            Presenters  and  acknowledgements………………………………………………………………………..42  

Page 3: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  3  

1 Introduction  

1.1 Scope  and  objectives  of  this  tutorial  section    

There  are  several  objectives  for  this  tutorial:  ● Describe  the  basics  of  the  Galaxy  user  interface  ● Learn  about  Histories,  workflows  and  related  functions  in  Galaxy  ● Learn  how  to  generate  a  History  ● Learn  about  useful  functions  in  Galaxy  for  managing  data  and  building  analyses  ● Learn  about  sharing  Histories  and  workflows  with  other  Galaxy  users      

 More  details  on  the  workings  of  Galaxy  are  available  online  through  the  core  Galaxy  project  at:  https://wiki.galaxyproject.org/Learn    

1.2 Outline  of  tutorial    

   

2 Basics  of  the  Galaxy  user  interface  

2.1 The  Galaxy  user  interface    Galaxy  employs  a  web-­‐based  user  interface.    The  interface  is  accessed  via  a  URL  that  directs  users  to  either  a  locally  installed  instance  or  an  instance  running  on  a  remote  server.    The  diagram  below  shows  the  basics  of  the  Galaxy  user  interface:    

Page 4: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  4  

   The  Tool  Pane  displays  an  organized  list  of  software  tools  available  to  users  in  a  particular  Galaxy  instance.    The  layout  of  the  tools  view  can  be  customized.    New  tools  can  be  added  to  a  Tool  Pane,  although  this  takes  some  advanced  understanding  of  Galaxy.    Initially,  most  Galaxy  instances  require  users  to  register  using  an  email  or  username  and  creating  a  password.    This  is  necessary  so  that  data  analysis  Histories  and  Workflows  can  be  assigned  to  each  individual  user  of  an  instance.    Users  can  register  by  selecting  the  “User”  dropdown  menu  above  the  Main  Viewing  Pane    (sometimes  called  the  Center  Pane).      

2.2 Histories  in  Galaxy    In  Galaxy,  a  record  of  any  analysis  run  “lives”  as  a  History.    The  History  contains  all  the  software  tools  used  in  an  analysis,  along  with  all  parameters  used  for  any  software  tool,  as  well  as  the  input  and  output  data  from  the  analysis.      Intermediate  input  and  output  data  is  also  saved  for  each  History  item  within  a  multi-­‐step  data  analysis.    Histories  may  be  short  (a  few  analysis  steps)  or  very  long  (hundreds  of  sequential  analysis  steps).        Histories  are  never  deleted,  but  rather  older  Histories  are  saved  when  a  user  chooses  to  generate  a  new  History  for  a  data  analysis.    The  active  History  is  shown  in  the  History  Pane  of  the  user  interface.  

Page 5: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  5  

3 Generating  a  History  I:    Building  a  protein  sequence  database  

3.1 Getting  the  data:    Shared  Data  Library    As  a  first  step  to  getting  familiar  with  Histories  in  Galaxy,  we  will  build  a  simple  History  aimed  at  creating  a  protein  sequence  database  that  can  be  used  for  matching  tandem  mass  spectrometry  (MS/MS)  data  to  peptide  sequences.    Ultimately  we  will  use  MS/MS  data  (with  permission)  from  a  published  proteomics  study  in  mice  (J  Proteomics  Bioinform.  2014,  7:  1000302).        First,  let’s  create  a  new  History.    Click  on  the  “wheel”  icon  (History  Options)  in  the  History  Pane.    Then  select  “Create  New”  from  the  dropdown  menu.    

     After  creating  a  new  History,  you  can  re-­‐name  the  History.    Click  on  “unnamed  History”  and  a  text  box  will  appear.    Re-­‐name  the  History  a  name  of  your  choosing.    Be  sure  to  hit  Enter  after  entering  the  name,  or  the  name  will  not  be  changed.  

Page 6: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  6  

   With  the  re-­‐named  (but  still  blank)  History  now  in  place,  let’s  go  get  some  data  to  build  a  History  for  creating  a  protein  sequence  database.        To  start,  we  are  going  to  utilize  the  Shared  Data  Library  as  a  means  to  bring  a  dataset  into  a  History.    Select  the  “Shared  Data”  tab  above  the  main  viewing  pane.    Then  select  “Data  Libraries”.        

   a)  When  the  Data  Library  is  loaded,  click  “Training  data”  à  “ASMS”.    Then  select  the  file  that  ends  in  “Customized_Splice_isoform_Protein_Database.fasta”.    Information  on  this  file  will  be  displayed  in  the  Main  Viewing  Pane.        

Page 7: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  7  

b)  Click  on  the  button  “to  History”  and  this  data  file  will  be  imported  into  your  active  History.            

     This  imported  data  file  is  a  FASTA  formatted  database  of  protein  sequences  generated  from  RNA-­‐seq  data  (from  J  Proteomics  Bioinform.  2014,  7:  1000302),  focusing  on  possible  proteins  expressed  from  splice  variants  encoded  in  the  transcriptomics  data.    Such  a  database  can  be  used  in  proteogenomics  studies,  where  MS/MS  data  from  peptides  can  be  used  to  confirm  expression  of  novel  protein  sequence  variants.    More  will  follow  up  with  more  on  this  later  in  this  tutorial.    (For  more  information,  see  BMC  Genomics,  2014,  15:703  that  describes  the  use  of  Galaxy  for  generating  and  using  these  novel  protein  sequence  databases).    Once  the  splice  isoform  database  is  loaded,  it  will  show  up  as  item  number  1  in  the  History.    You  may  want  to  re-­‐name  these  items  something  shorter  and  more  informative  (as  has  been  done  in  the  screenshot).  

     

3.2 Using  the  FASTA  database  downloader  Tool  and  editing  History  items    With  the  splice  isoform  proteins  loaded,  we  are  next  going  to  import  two  other  protein  sequence  databases.    One  will  be  the  Uniprot  database  of  annotated  and  reviewed  proteins  known  to  be  expressed  in  mice.    The  other  will  be  a  database  of  contaminant  proteins  known  to  be  commonly  found  in  proteomic  samples.    Ultimately,  these  three  different  databases  will  be  merged,  and  used  for  matching  of  peptide  sequences  to  the  MS/MS  data.    

Page 8: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  8  

First,  let’s  download  the  Uniprot  Mouse  Database.    Follow  these  steps:  

a) In  the  search  box  under  “Tools”  type  “Protein  Database  Downloader”  and  double-­‐click  on  the  tool  

b) From  the  flowing  drop-­‐down  menu  set  the  following  parameters:  Download  From  →  UniProtKB  Taxonomy  →  Mus  musculus  (Mouse)  Reviewed  →  UniProtKB  Proteome  Set  →  Reference  Proteome  Set  Include  Isoform  Data  →  Yes  

c) Click  “Execute”  

 After  clicking  Execute,  a  second  step  in  the  History  will  appear,  labeled  generically  as  “Protein  database”.    Next,  let’s  download  the  contaminant  protein  sequence  database  into  the  history.  

a)  Click  again  on  the  “Protein  Database  Downloader”  tool  

b)  From  the  “Download  from”  drop-­‐down  menu  select  “cRAP  (contaminants)”  

c)  Click  Execute  

Page 9: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  9  

 

Edit  Attributes  in  Galaxy  (i.e.  renaming  History  items)    After  downloading  the  cRAP  contaminants  database,  the  History  will  contain  three  items.    Steps  two  and  three  will  be  simply  named  “Protein  Database”.    To  avoid  confusion,  let’s  use  the  “Edit  Attributes”  function  in  Galaxy  to  re-­‐name  these  History  steps  to  something  more  informative.    Next  to  each  History  step,  you  will  see  a  pencil  icon  (Edit  Attributes).    When  you  click  on  this  icon  you  an  editing  pane  will  appear  in  the  Main  Viewing  Pane.    A  new  name  for  the  corresponding  History  item  can  be  changed  in  this  editing  pane.    Clicking  on  Save  will  change  the  name  of  the  History  item.  

   For  this  analysis,  re-­‐name  History  item  2  to  something  such  as  “Mouse  Uniprot  Database”  and  History  item  3  to  “Contaminant  Database”.      

Page 10: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  10  

3.3 Using  the  Merge  FASTA  database  tool    As  a  final  step  let’s  merge  all  three  of  our  FASTA  databases  into  a  single  database,  that  we  can  use  for  matching  peptide  sequences  to  MS/MS  data.  

a)  In  the  search  box  under  “Tools”  type  “FASTA  Merge  File  and  Filter  Unique  Sequences”  and  double-­‐click  on  the  tool  and  click  “Add  FASTA  file”  

b) From  the  drop-­‐down  menus  select  the  following  parameters:  1:  Input  FASTA  Files  →  Mouse  Uniprot  Database  2:  Input  FASTA  Files  →  Contaminant  Database  3:  Input  FASTA  Files  →  Customized_Splice_isoform_Protein_Database  

c) Click  Execute    

     A  fourth  History  item  will  now  appear,  called  “Merged  and  Filtered  FASTA  from  data  1,  data  3,  and  data  2.    Use  the  Edit  Attributes  tool  again  to  name  this  something  more  informative,  such  as  “Merged  Protein  Database”.  

Page 11: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  11  

4 Generating  a  History  II:    Sequence  Database  Searching  and  Protein  Identification  

4.1 Using  SEARCHGUI  for  sequence  database  searching  of  a  Dataset  Collection    Now  that  we  have  made  a  merged  FASTA  sequence  database,  next  we  will  use  this  database  to  match  MS/MS  to  peptides  sequence,  via  a  sequence  database  search.    For  this,  we  will  use  the  sequence  database  searching  program  called  SEARCHGUI  (Proteomics,  2011,  11:996-­‐9).    SEARCHGUI  bundles  several  open-­‐source  and  freely  available  sequence  database  searching  programs,  facilitating  analysis  of  MS/MS  data  using  more  than  one  algorithm  and  increasing  confidence  in  results.    SEARCHGUI  has  been  deployed  in  Galaxy.    Here  we  will  use  it  to  match  MS/MS  spectra  to  sequences  in  our  merged  database.    To  carry  out  such  a  search,  we  will  need  data  files  containing  MS/MS  data.    To  import  these  into  your  History,  click  on  the  “Shared  Data”  dropdown  menu,  and  select  “Data  Libraries”  from  this  list.  Click  “Training  data”  à  “ASMS”  and  select  the  file  ending  in  “Example_MGF_File_1.mgf”.    Click  on  the  “to  History”  button  to  import  it  into  your  active  History.    Next,  go  back  to  the  Data  Library  and  click  on  the  file  ending  in  “Example_MGF_File_2.mgf”.    Click  on  the  “to  History”  button  to  import  it  into  your  active  History.      You  now  should  have  these  two  files  added  to  your  History  (Items  5  and  6).    You  may  want  to  re-­‐name  these  items  something  shorter  and  more  informative.    

   MGF  files  are  “Mascot  Generic  Format”  files,  which  have  been  converted  from  raw  mass  spectrometry  data  files.    These  files  contain  the  peak  list  information  from  each  MS/MS  spectrum  recorded  in  the  raw  data  files,  and  are  compatible  for  analysis  using  SEARCHGUI.    

Page 12: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  12  

Dataset  Collections    We  have  chosen  to  analyze  two  MGF  files.      For  their  analysis,  we  will  use  a  Galaxy  function  called  “Dataset  Collections”  to  group  these  files  into  a  single  collection  for  analysis  by  SEARCHGUI.    The  Dataset  Collections  function  is  useful  when  a  user  needs  to  analyze  multiple  files  using  the  same  software  tool  and  parameters.    Once  defined  as  a  collection,  the  software  tool  will  analyze  each  file  within  this  collection  using  the  same  parameters,  eliminating  the  need  to  set-­‐up  separate  analysis  steps  for  each  file  one-­‐by-­‐one  (more  information  on  Dataset  Collections  can  be  found  here:  https://wiki.galaxyproject.org/Histories#Dataset_Collections    To  define  a  Dataset  Collection:    a)  Click  on  the  check  box  (Operations  on  multiple  datasets)  button  in  your  History    b)  Once  selected,  a  check  box  will  appear  beside  each  item  in  your  History.      Check  the  boxes  next  to  your  two  MGF  files  (History  steps  5  and  6).      c)  Hit  the  button  “For  all  selected  files”.      d)  A  dropdown  menu  will  appear,  where  you  will  select  “Build  dataset  list”.    e)  A  dialogue  window  will  appear.    This  shows  the  files  that  will  be  a  part  of  the  dataset  list.    There  is  also  a  window  for  naming  this  dataset  collection.    Enter  “Collection  of  MGF  files”  here  and  click  “Create  List”.    A  new  step  in  your  History  will  now  appear  (Step  #7),  which  is  a  Dataset  Collection  containing  the  two  MGF  files.        f)  Click  on  the  “Operations  on  multiple  datasets”  again  to  leave  the  Dataset  Collections  and  go  back  to  normal  History  operations.    

Page 13: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  13  

   Setting  up  SEARCHGUI  for  a  sequence  database  search    Now  we  will  use  the  SEARCHGUI  program  to  match  MS/MS  to  these  two  MGF  files.    In  the  Tool  pane  search  window,  type  in  “searchgui”.    Click  on  the  SEARCHGUI  tool,  and  a  parameter  window  will  be  displayed  in  the  Main  Viewing  pane.    We  will  walk  through  a  number  of  these  settings  in  order  to  utilize  SEARCHGUI  on  these  example  MGF  files.        To  set-­‐up  the  SEARCHGUI  analysis  follow-­‐these  steps:    

a)  In  the  “Protein  Database”  window,  select  “Merged  Protein  Database”  (History  Item  4)  

b)  Select  “Yes”  for  “Create  a  concatenated  target/decoy  database  before  running  PeptideShaker”  (this  must  be  checked  for  PeptideShaker  to  run  successfully  and  estimate  a  false-­‐discovery  rate  for  peptide  sequence  matches,  PSMs)  

c)  For  the  gene  mappings  window,  select  “no”.  

Page 14: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  14  

d)  For  Input  Peak  lists  (mgf),  first  click  the  single  folder  button  (Dataset  collection).    Then  select  the  Dataset  Collection  of  MGF  files  (Item  7  in  History).  

e)  The  “DB-­‐Search  Engines”  window  contains  a  selection  of  Sequence  database  searching  programs  that  are  available  in  SEARCHGUI.    Any  combination  of  these  programs  can  be  used  for  generating  PSMs  from  MS/MS  data.    For  the  purpose  of  this  tutorial,  we  will  select  all  four  available  programs.  

f)  These  values  can  be  used  for  the  following  windows:     -­‐-­‐  Precursor  Ion  Tolerance  Units:  Parts  per  million  (ppm)     -­‐-­‐  Precursor  Ion  Tolerance:  10     -­‐-­‐  Fragment  Tolerance  (Daltons):  0.1  (this  is  high  resolution  MS/MS  data)     -­‐-­‐  Enzyme:  Trypsin     -­‐-­‐  Maximum  Missed  Cleavages:  2  

 

 

g)  Scroll  down  the  page.    For  the  Fixed  Modifications  Window,  three  selections  should  be  made  in  the  input  window:  Carbamidomethylation  of  C,  iTRAQ  4-­‐plex  of  K,  and  iTRAQ  4-­‐plex  of  peptide  N-­‐term.    Typing  the  first  few  letters  of  each  entry  in  the  window  will  bring  up  each  selection.  

h)  For  Variable  Modifications,  select  the  following  in  the  input  window:  Oxidation  of  M,  and  iTRAQ  4-­‐plex  of  Y  (a  modification  that  sometimes  occurs  with  iTRAQ).  

Page 15: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  15  

   

i)  For  the  remaining  parameters,  we  will  use  the  default  values  as  they  appear  in  the  tool  parameters.  

j)  Hit  “Execute”  to  run  SEARCHGUI.  

A  new  History  item  (#8)  will  now  appear,  and  be  colored  yellow,  indicating  that  SEARCHGUI  is  running.    The  History  item  will  turn  green  when  the  analysis  is  complete.  

Once  the  database  search  is  completed,  the  SEARCHGUI  tool  will  output  a  file  (called  a  SEARCHGUI  archive  file)  that  will  serve  as  an  input  for  the  next  section.  

4.2 Using  PeptideShaker  for  identifying  peptides  and  proteins    PeptideShaker  (  Nat  Biotechnol.,  2015,  33:22-­‐4)  is  a  companion  tool  that  works  with  output  from  SEARCHGUI.    It  serves  to  organize  the  PSMs  outputted  from  SEARCHGUI,  and  contained  in  the  SEARCHGUI  archive,  providing  an  assessment  of  confidence  of  the  data,  inferring  protein  identifies  from  the  matched  peptide  sequences,  and  producing  outputs  that  can  be  visualized  by  users  to  interpret  results.    PeptideShaker  has  been  wrapped  in  Galaxy  to  work  in  combination  with  SEARCHGUI  outputs.    To  use  PeptideShaker  to  organize  the  results  of  our  SEARCHGUI  analysis,  again  go  to  the  Tool  window  in  the  Tool  pane  and  type  in  “PeptideShaker”.    Click  on  PeptideShaker  in  the  tool  menu.          

Page 16: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  16  

   Follow  these  steps  to  run  PeptideShaker:    

a)  In  the  “Compressed  SearchGUI  results”  field  select  item  #8  in  your  search  history  (this  is  the  SEARCHGUI  archive  file).  

b)  For  the  species  type,  select  “No  species  restriction”.  

c)  For    both  the  “Specify  Advanced  PeptideShaker  Processing  Options”  and  “Specify  Advanced  Filtering  Options”  fields  select  “Default”  options.  

d)    The  “Output  Options”  window  shows  the  many  options  available  for  outputs  from  PeptideShaker.    For  this  example,  let’s  select  the  following  options  for  outputs:  

-­‐-­‐  mzidentML  File  (a  community    standard  for  reporting  sequence  database  search  results)  

-­‐-­‐  PSM  report  (all  information  about  PSMs  from  SEARCHGUI,  tabular  text  format)  

-­‐-­‐  Peptide  Report  (all  information  on  peptide  sequences  identified  from  PSMs,  tabular  text  format)  

-­‐-­‐  Protein  Report  (all  information  on  inferred  proteins  from  identified  peptides,  tabular  text  format)  

-­‐-­‐  Certificate  of  Analysis  (A  text  file  with  information  on  parameters  used  in  PeptideShaker  analysis  and  summary  of  results)  

-­‐-­‐  Hierarchical  report  (An  expanded  output  with  information  on  proteins  and  peptides  identified,  tabular  text  format)  

Page 17: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  17  

e)    Hit  “Execute”  

A  number  of  new  items  will  appear  in  your  History,  each  corresponding  to  the  outputs  selected  in  the  PeptideShaker  parameters.    These  will  be  colored  yellow  until  the  job  completes  and  they  will  go  to  green.  

4.3 Galaxy  functions:  Viewing  and  downloading  results,  editing  History  items,  re-­‐running  analysis  steps  in  a  History      Now  that  we  successfully  built  a  History  to  conduct  sequence  database  searching  and  generate  outputs  of  identified  proteins  and  peptides,  let’s  take  a  look  at  some  functions  within  the  Galaxy  framework  that  users  may  find  highly  useful.    This  is  not  a  comprehensive  listing  of  Galaxy  functions,  but  some  of  those  that  may  be  of  highest  practicality  and  value.  

i)  Viewing  and  downloading  results,  re-­‐running  analyses.      The  results  contained  in  a  History  item  can  be  viewed  by  clicking  on  the  name  of  the  item.    For  example,  let’s  click  on  Item  #9  in  our  History,  the  mzidentML  file  outputted.    When  this  name  is  clicked,  the  History  item  expands  to  show  information  about  the  format  of  the  file  and  other  information.    A  number  of  additional  buttons  are  also  revealed  for  viewing  information  on  the  file.      

Let’s  focus  on  some  useful  functions  within  this  expanded  view.  

a) “The  eye”.    Clicking  on  ‘the  eye”  (View  data)  provides  a  view  of  the  formatted  file  contents  in  the  Main  Viewing  pane,  for  compatible,  non-­‐binary  formatted  file  types.    Binary  formatted  file  types  are  automatically  downloaded  when  clicking  on  the  View  data  button.  

b) In  the  expanded  view  a  Download  button  with  a  hard  disk  icon  is  available.    Clicking  this  button  will  automatically  download  the  file  to  the  local  hard  drive  Download  folder.  

c) A  button  containing  the  letter  “i”  (View  details)  is  also  revealed.    Clicking  the  View  details  button  will  bring  up  a  summary  of  information  about  this  file  such  as  format,  size,  data  created,  Galaxy  tools  used  in  its  generation  and  also  an  inheritance  chain,  for  files  that  were  copied  from  other  Histories.  

d) A  very  valuable  function  is  the  re-­‐run  or  “Run  this  job  again”  button  containing  the  circular,  two-­‐arrow  icon.    Clicking  on  this  button  will  bring  up  the  tool  parameters  used  for  the  initial  analysis  in  the  Main  Viewing  Pane.    The  tool  can  be  executed  again  using  these  same  parameters,  or  the  parameters  changed  and  the  analysis  re-­‐run.    A  new  History  item  will  be  produced  with  the  output.    This  is  an  efficient  way  to  test  outcomes  using  altered  parameters  for  a  Galaxy  tool.  

Page 18: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  18  

     4.4  Extracting  a  Workflow  from  a  History.        Finally,  let’s  learn  about  a  valuable  function  in  Galaxy,  extracting  the  workflow  from  a  completed  History.    Workflows  differ  from  Histories  in  that  they  are  a  series  of  defined  

Page 19: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  19  

tools  or  actions,  but  lack  the  input  and  output  data  (https://wiki.galaxyproject.org/Learn/AdvancedWorkflow).    Histories  contain  not  only  all  the  tools  and  actions,  but  also  all  the  input  and  output  data.    Workflows  can  be  easily  extracted  from  a  completed  History.    Click  on  the  wheel  icon  (History  options)  at  the  top  of  the  History  pane,  and  select  “Extract  workflow”  from  the  drop-­‐down  menu.      

     A  workflow  window  will  open  in  the  Main  Viewing  Pane.    Here,  the  name  of  the  extracted  workflow  can  be  specified,  and  the  tools  included  in  the  workflow  can  be  selected.    Clicking  on  “Create  workflow”  will  create  and  store  the  specified  workflow.    The  extracted  workflow  can  be  accessed  by  clicking  on  the  “Workflow”  tab  in  the  Main  Viewing  Panel,  and  will  be  

Page 20: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  20  

listed  under  “Your  workflows”.    By  clicking  on  any  workflow  in  this  list,  you  can  choose  to  run  the  workflow,  edit  the  workflow,  or  share  the  workflow  with  other  Galaxy  users.    

   

   

 

         

Page 21: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  21  

5. PeptideShaker  Outputs    This  session  of  the  workshop  will  take  you  through  the  processing  of  search  results  (PSM  Report)  generated  via  SearchGUI  /  PeptideShaker  analysis.  This  will  include  a  blueprint  workflow  for  a)  Generating  a  PSM  summary  of  peptides  derived  from  RNA-­‐Seq  derived  db;  b)    Converting  peptide  list  into  a  FASTA  format    (as  an  input  for  BLAST-­‐P  analysis);  c)  BLAST-­‐P  searches  and  filtering.    

Outline  of  tutorial    

   

Reference  materials  Salivary  proteogenomics  workflow  manuscript:  http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4261978/http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4261978/  Hibernation  proteogenomics  manuscripts:  http://www.ncbi.nlm.nih.gov/pubmed/26435507  

             http://www.ncbi.nlm.nih.gov/pubmed/26903422                Multi-­‐omics  overview:http://www.nature.com/nbt/journal/v33/n2/full/nbt.3134.html      

Page 22: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  22  

     Protein Report  

- valid proteins  - coverage  - molecular weight  

 Peptide Report  

- valid peptides  - potential novel proteoforms based on

accession numbers  - sequences  - modifications and localization score  - confidence  

 Spectrum (PSM) Report  

- valid spectra  - potential novel proteoforms based on

accession numbers  - sequences  - modifications and localization score  - confidence  - m/z, charge state, Δm/z  

 Summary (Parameters)  

- valid peptides  - valid proteins  - valid spectra  

 Archive (zipped file)  

- CPS file to visualize data      mzIdentML  

- PSM Visualization  - SWATH Analysis  - Skyline  - Scaffold  

 

5.1      PSM  Report  (PeptideShaker  Output)  

Page 23: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  23  

PSM  Report  c1:  Column  1:  Rank  of  protein  group  c2:  Protein(s):  Accession  numbers  of  protein  groups  c3:  Sequence:  Amino  acid  sequence  of  the  identified  peptide  c4:  Variable  modifications  c5:  Fixed  Modifications  c6:  Spectrum  File:  Input  MGF  file  of  the  identified  PSM  c7:  Spectrum  Title:  Fraction  number,  scan  number  and  charge  state  c8:  Spectrum  Scan  Number  c9:  Retention  Time  c10:  m/z:  Mass  to  charge  ratio  c11:  Measured  Charge  c12:  Identification  Charge  c13:  Theoretical  Mass:  Calculated  from  identified  peptide  sequence  c14:  Isotope  Number  c15:  Precursor  m/z  Error  [ppm]  c16:  Localization  Confidence  c17:  probabilistic  PTM  score  c18:  D-­‐score  c19:  Confidence  c20:  Validation:  Confidence  >  85  and  delta  ppm  within  6  ppm  are  CONFIDENT  PSMs    

The  PSM  Report  contains  information  about  the  peptide-­‐spectral  matching  of  all  spectra  within  the  dataset.  The  report  contains  Sequence  of  the  peptide  (c3),  the  

Spectrum  scan  information  (c7)  and  its  associated  Confidence  score  (c19).  

 

   

Page 24: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  24  

 

5.2            Current  history    �Your  current  status  of  history  would  be  an  output  generated  from  searching  9  RAW  files  instead  of  two  that  were  used  in  session  1.  These  would  have  a)  Parameters,  b)  PSM  Report  and  c)  Protein  Report  outputs  from  PeptideShaker  analysis.  You  will  need  to  import  tutorial  datasets  into  your  current  history.    

   We  will  be  processing  the  PSM  Report  and  using  its  outputs.    

5.3    Import  tutorial  datasets  into  current  history      

Tutorial  Dataset:  At  the  top  click  on    “Shared  Data”  and  then  Histories.  

 

Page 25: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  25  

 Select  “History  3”  from  the  list  of  histories.  

   Import  the  History  into  your  account  

   

 

6 Running  workflow  for  this  session    6.1          Inputs  for  the  session  2  workflow    

 For  Session  2,  the  inputs  that  would  be  needed  are  PSM  Report.  Read  2.3  to  get  the  right  inputs  for  this  workflow.    

6.2              Workflow  for  the  session      Select  History  3  as  your  active  history.  

Page 26: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  26  

   Select  Shared  Data    and    click  on  workflows  .      

   Select  ASMS  2016:…  workflow  and  import  it  into  your  account.  

 

Page 27: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  27  

     

   

 

6.3              Workflow  functions      

     There  are  various  options  for  using  the  workflow.    We  will  be  using  Edit  and  Run  later  in  the  workshop.  Here  is  a  short  description  of  some  of  the  functions:  1)  Share  or  Publish:  An  user  can  share  a  link  to  another  user  (who  has  an  account  on  same  server).  The  workflows  can  also  be  published  for  all  of  the  users  to  view  /  use.  2)  Download  or  Export:  This  feature  gives  you  an  ability  to  transfer  workflows  within  two  Galaxy  instances.  An  user  can  download  the  workflow  as  a  .ga  file  that  preserves  the  names,  parameters    and  sequences  of  tools  that  are  used  in  a  workflow.  One  can  also  download  a  

Page 28: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  28  

hyperlink  that  can  be  used  to  download  a  workflow.  This  function  can  also  be  used  to  store  your  workflows  in  myExperiment  with  your  login  and  password.  Lastly,  this  feature  can  be  used  to  generate  a  workflow  image  for  presentation  or  publication.  3)  Copy:  This  function  is  used  to  copy  workflow  so  that  a  modified  version  can  be  generated  from  a  master  copy.  The  modified  workflow  might  have  alternative  parameters,  tools  or  sequence  of  tools.  4)  Rename:  This  is  a  function  to  change  the  name  of  the  workflow.  5)  View:  This  offers  an  ability  to  have  a  linear  overview  with  parameters  of  the  workflow  along  with  annotations  if  any.  6)  Delete:  You  can  also  delete  older  versions  of  workflow.  Use  this  with  caution!  Might  result  in  deleting  hours  of  your  work!      Edit:    This  a  powerful  function  that  provides  overview  of  the  workflow.  We  can  change  parameters,  names  of  outputs,  and  edit  tools  (add  or  remove)  using  the  Edit  mode.    Click  in  Edit  to  open  this  option.    

   You  can  explore  various  options  in  edit  mode  including  –  renaming  inputs,  changing  parameters,  adding  or  removing  tools,  etc.  However,  please  ensure  that  you  DO  NOT  SAVE  if  you  plan  to  use  the  current  version  of  the  workflow.  However,  if  you  would  like  to  retain  the  changes  that  you  have  made  –  please  do  not  forget  to  SAVE  the  workflow.    

Page 29: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  29  

Generating  a  PSM  summary  of  peptides  derived  from  RNA-­‐Seq  derived  db  

Converting  peptide  list  into  a  FASTA  

format      

BLAST-­‐P  searches  and  filtering  

    Workflow for Session is a multi-step workflow. In this workflow, 33 processing steps are used to take the PSM Report from PeptideShaker and manipulate it to put through BLAST-P analysis to verify novel proteoforms.  

         

 Overview of Workflow   Step 1: Input dataset (PSM Report)  

- Steps 2-8: Selects peptides with accession number from RNASeq-derived protein FASTA file. (See Section 6.7 below for details)  

- Step 9: PSM Report of peptides identified from RNASeq-derived proteins.  - Steps 10-18: Conversion of peptide list into a FASTA format  - Step 19: Short BLAST-P on NCBI remote nr mouse database  - Step 20: BLAST-P on NCBI remote nr mouse database  

Page 30: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  30  

- Steps 21-28: Identifies mismatched peptides.  - Step 29: Peptides corresponding to novel proteoforms.  - Steps 30-32: Conversion to PSM Report of peptides corresponding to novel proteoforms.  - Step 33: PSM Report of peptides corresponding to novel proteoforms.  

You  can  run  a  workflow  through  the  EDIT  interface  or  through  the  workflows  interface.    Let  us  use  the  Run  function  at  the  workflows  session  to  run  the  workflow.  Please  remember  that  you  should  have  your  active  history  as  the  Input  History  to  run  the  workflow.  

6.4              Running  the  Workflow        

               Select ‘PSM Report’, for Step1.  

Page 31: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  31  

   Run Workflow.  

     

Page 32: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  32  

 If your workflow ran successfully, we will use the history to go through the steps.    If not, then download the ‘History 4’ from Data Library.(then import and into Saved Histories)  

Page 33: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  33  

6.5        Switching  to  a  completed  history    

   Unhide  Hidden  Datasets  from  your  completed  workflow  OR  from  the  “END History for Session 4” that you have as your current history.  Once “unhidden” you should see 33 datasets within your history.  

   To view other steps in detail, search specific tools using the left panel by clicking on any of the eye icons.      Go  through  all  steps.                              

Page 34: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  34  

6.6        Quick  overview  of  history  options          

       

Once  you  click  on  the  wheel  sign  in  the  top  right  corner,  you  will  see  multiple  options  as  HISTORY  LISTS,  HISTORY  ACTIONS,  DATASET  ACTIONS,  DOWNLOADS  and  OTHER  ACTIONS.    Here  is  a  brief  overview  of  each  of  the  options:    HISTORY  LISTS  

• Saved  Histories:  Helps  user  to  open  all  user  histories  in  the  main  viewer  pane  /  central  pane.  

Page 35: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  35  

• Histories  Shared  with  Me:  User  can  access  histories  that  have  been  shared  by  other  user  specifically  using  account  on  the  same  server.  HISTORY  ACTIONS  

• Create  New:  Generates  a  new  history.  • Copy  History:  Copies  all  datasets  in  a  history.  • Share  or  Publish:  Allows  a  user  to  share  the  history  via  a  link  or  share  with  other  

user  by  typing  in  his  login  email.  • Show  structure:  Shows  details  of  workflow  parameters  that  were  used  in  each  step.  • Extract  Workflow:  Helps  extract  a  workflow  for  subsequent  analysis  on  similar  

datasets  /  replicates.  (See  section  4.4  for  more  information)  • Delete:  If  you  want  to  delete  the  history.  (Use  with  caution!)  • Permanently  delete:  If  you  really  hate  the  history  and  want  to  permanently  delete  it  

(Use  with  extreme  caution!)    DATASET  ACTIONS  

• Copy  Datasets:  Helps  to  copy  selected  datasets  from  one  history  to  another.  • Dataset  Security:  Can  set  permissions  and  roles  to  various  users  to  access  or  edit  the  

history.  • Resume  paused  jobs:  Resumes  jobs  that  have  been  paused.  • Collapse  Expanded  Datasets:  Helps  in  collapsing  in  expanded  datasets.  • Unhide  Hidden  Datasets:  Helps  in  unhiding  all  the  hidden  datasets  from  a  workflow  

(See  section  6.5  for  details)  • Delete  Hidden  Datasets:  Deletes  datasets  that  have  been  hidden.  • Purge  Deleted  Datasets:  Purges  deleted  datasets.  

                                   

Page 36: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  36  

6.7        Generating  a  PSM  summary  of  peptides  derived  from  RNA-­‐Seq  derived  db      

       

 Step  8:  Selects  peptides  with  accession  number  starting  with  preB  or  proB.      Step  9:  PSM  Report  of  potential  novel  PEPTIDES  –  click  on  eye  icon  to  view  details  of  the  PSM  Report  in  the  main  panel.          

6.8      Converting  peptide  list  into  a  FASTA  format  (as  an  input  for  BLAST-­‐P  analysis)    

Let  us  focus  on  steps  2  to  12    Step  2:  Removes  the  beginning  line  of  the  PSM  Report  (Now  we  are  without  headers  and  will  need  to  use  columns  as  our  headers!)  Note: To view details of a step, click on the step number and then click on the ‘rerun’ icon. DO NOT hit rerun though!  Step  3:  Sorts  PSM  Report  with  increasing  Spectrum  Title  (column  7)  ascending  order  and  Confidence  (column  19)  in  descending  order.  This  ensures  that  the  highest  ranking  PSM  for  that  spectrum  title  is  at  the  top.  Step  4:  Ranks  columns  based  on  the  new  sorting  performed  in  Step  3.  Step  5:  Group  -­‐  helps  in  selecting  only  one  PSM  per  Spectrum  Title.  Step  6  and  7:  Join  and  Cut  -­‐  generates  PSM  Report  of  one  PSM  per  spectrum  title.  Step  8:  Selects  peptides  with  accession  number  starting  with  preB  or  proB.  (Details  in  figure  below)      

Page 37: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  37  

Let  us  focus  on  steps  10  to  18    Step  10:  Generates  a  peptide  list  along  with  ranking  number  by  cutting  column  3  (c3)  and  column  21  (c21)  from  the  step  9  tabular  format  file.    Step  11:  Generates  a  peptide  list  by  cutting  column  1  (c1)  step  10  tabular  format  file.    Step  12:  Generates  a  FASTA  format  from  Step  11  in  the  following  format:  >PEPTIDE  PEPTIDE    Step  13:  Computes  sequence  length  on  data  12.    Step  14:  Generates  a  FASTA  Format  from  Data  13.  >PEPTIDE_length  PEPTIDE    Step  15:  Filters  sequences  from  length  8  to  30  aas  from  the  list  of  sequences.    Step  16:  Filters  sequences  from  length  31  to  50  aas  from  the  list  of  sequences.      Step  17:  Converts  Step  15  output  to  a  format  so  that  it  can  be  searched  by  short  BLAST-­‐P  search.  >PEPTIDE_sequence  length=length  aa  PEPTIDE    Step  18:  Converts  Step  16  output  to  a  format  so  that  it  can  be  searched  by  short  BLAST-­‐P  search.  >PEPTIDE_sequence  length=length  aa  PEPTIDE    

 

               

 

Page 38: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  38  

   Step  17  and  18:  Converts  Step  15  and  16  output  to  a  format  so  that  it  can  be  searched  by  using  short  BLAST-­‐P  search.    

6.9    BLAST-­‐P  searches  and  filtering                                                    

 

Page 39: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  39  

BLAST-­‐P  Search  and  output  processing  is  carried  out  from  steps  19  to  33.    Step  19:  This  step  performs  short  BLAST-­‐P  search  on  peptide  FASTA  sequences  and  generates  a  XML  output.  The  short  BLAST-­‐P  uses  parameters  for  short  peptide  sequences  (8-­‐30  aas).  Please  use  the  rerun  option  to  look  at  the  parameters  used.    Step  20:  This  step  performs  BLAST-­‐P  search  on  peptide  FASTA  sequences  with  31  aas  or  longer  and  generates  a  XML  output.  The    BLAST-­‐P  uses  parameters  for  long  peptide  sequences  (31-­‐50  aas).  Please  use  the  rerun  option  to  look  at  the  parameters  used.    

   Step  21  and  23:  Converts  BLAST  XML  output  into  a  tabular  output  with  various  metrics  such  as  a)  ID  of  your  sequence  (c1);  b)  Percentage  of  identical  matches  (c3);  c)  Total  number  of  gaps  (c17)  d)  Alignment  length  (c4)  and  Query  length  (c23).      Step  22  and  24:  Query  sequences  with  no  hits  for  data  19  and  20  respectively.    Step  25:  Calculates  percentage  of  alignment  length  versus  actual  query  length  and  adds  it  as  column  25.    Steps  26:  Selects  peptides  with  -­‐  Percentage  of  identical  matches  (c3)  less  than  100  OR  Total  number  of  gaps  (c17)  is  at  least  one  OR  percentage  of  alignment  length  versus  actual  query  length  is  less  than  100.      Steps  27  -­‐29:  Generates  a  list  of  peptides  corresponding  to  novel  proteoforms.    Steps  30-­‐33:  Generates  a  PSM  Report  of  peptides  corresponding  to  novel  proteoforms.        

 

BLAST-­‐P  SEARCH  

BLAST (Basic Local Alignment Search Tool) is a web-based tool used to compare biological sequences. BLAST-P, matches protein sequences against a protein database. More specifically, it looks at the amino acid sequence of proteins and can detect and evaluate the amount of differences between say, an experimentally derived sequence and all known amino acid sequences from a database. It can then find the most similar sequences and allow for identification of known proteins or for identification of potential peptides associated with novel proteoforms.  

Page 40: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  40  

   6.10    PSM  Evaluation  and  Genome  Visualization      Once  peptides  corresponding  to  novel  proteoforms  are  identified  they  are  subjected  to  Peptide-­‐Spectral  Match  (PSM)  evaluation.  This  involves  PSM  Visualization  that  reveals  whether  a  reported  high-­‐scoring  spectrum  is  in  fact  a  result  of  several  unmatched  ions.  Validation  of  PSMs  is  often  considered  the  final  step  before  reporting  protein  identifications.    Good  quality  PSMs  are  subsequently  placed  on  the  genome.  The  localization  of  each  peptide  can  reveal  intriguing  genomic  architecture.  In  essence,  proteogenomics  involves  the  mapping  of  an  experimental  proteome  to  an  established  genome.  Clustering  of  proteoforms  in  a  particular  genomic  region  may  implicate  a  point  of  interest  for  further  research.  For  an  excellent  review  on  proteogenomics  read  review  by  Nesvizhskii  et  al  (2014).        

 

What are Proteoforms?    Due to the genomic complexity and redundancy of proteins and the associated post-translational modifications that can occur during or after their expression, there can be a number of proteoforms associated with a protein. A proteoform is the product that results from a protein’s specific genetic code and all the modifications molding it (e.g. post-translational modifications) or its transcription (e.g. alternatively spliced RNA and allelic variations). For more information about proteoforms please read manuscript by Smith and Kelleher (2013).    Why are they so important?  Proteoforms contribute to biological diversity. Because of chemical differences, proteoforms not only differ in structure, but in function as well. This leads to several different process modulations that affect cells differently, contributing to variation between and within individuals.    Identifying Peptides Corresponding to Novel Proteoforms  Proteoforms retain a lot of similarity with one another, which can make it hard to identify them from one another. Since the advent of proteomics, peptides corresponding to novel proteoforms are continually being identified after verification through BLAST analysis. Once validated, these proteoforms help in a more complete annotation of the genome and also identification of a role for such novel biomarkers in disease and physiological states such as cancer.      

Page 41: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  41  

7 Instructions  for  accessing  the  ASMS  Galaxy-­‐P  Docker  Container    Galaxy  is  now  available  in  Docker  containers.    Docker  containers  are  an  easy  way  to  package  software  for  installation  on  other  systems.  The  Docker  Toolbox  now  includes  Kitematic,  a  user  interface  for  running  Docker  containers  on  Windows  and  Mac  OS  X  systems.  Kitematic  makes  it  easy  to  run  any  published  Docker  container  on  these  systems.    

To  try  a  pre-­‐configured  Galaxy  instance  on  your  Mac  OS  X  or  Windows  machine,  follow  these  steps:  1.  Install  the  Docker  Toolbox    on  your  computer  (note  you  may  need  to  enable  Virtualization  Technology  for  Docker  to  run.    To  do  this  on  Windows,  see:  http://www.howtogeek.com/213795/how-­‐to-­‐enable-­‐intel-­‐vt-­‐x-­‐in-­‐your-­‐computers-­‐bios-­‐or-­‐uefi-­‐firmware/)  2.  Once  the  Docker  Toolbox  is  installed,  launch  Kitematic  (the  interface  for  downloading  and  running  Docker  containers).  3.  Search  for  "asmsgalaxyp".    This  searches  Docker  Hub,  a  repository  for  Docker  containers.  Hit  the  “Create”  button  in  the  Docker  container.    Kitematic  will  download  the  container  and  install.    

   4.  Once  the  instance  has  started  (it  may  take  a  few  minutes  to  load),  click  anywhere  on  the  web  preview  pane  (upper  right  of  page),  and  you  have  a  running  Galaxy  instance!  

 

Page 42: Documentation for ASMS 2016 Galaxy-P Workshop · ASMS!2016:!Galaxy!for!Proteomics!Data!Analysis:!An!Interactive!Demonstration!! 3! 1 Introduction+ 1.1 Scopeandobjectivesofthistutorialsection

ASMS  2016:  Galaxy  for  Proteomics  Data  Analysis:  An  Interactive  Demonstration  

  42  

8 Presenters  and  acknowledgements      

Presenters:  

The main presenters are members of Galaxy-P research team at the University of Minnesota, working an ongoing project developing Galaxy for multi-omic applications (National Science Foundation Grant 1458524). We have in-depth experience in Galaxy and its use for multi-omics data analysis. (z.umn.edu/galaxypreferences).  

Speakers in our session include:  

● Tim Griffin, Professor, and Faculty Director, CMSP, University of Minnesota. Dr. Griffin is the Principal Investigator on the project developing Galaxy for multi-omics.  

● Pratik Jagtap, Assistant Professor, and Managing Director, CMSP, University of Minnesota. ABRF Member. Member of Protein Research Group (PRG).  

 Thanks to  

 

   

 

Also  thanks  to  Amazon  Web  Services  Education  Research  Grant.