Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
IMPLEMENTING A WEB-BASED INTRODUCTORY BIOINFORMATICS
COURSE FOR NON-BIOINFORMATICIANS THAT INCORPORATES
PRACTICAL EXERCISES
Antony T. Vincent1, Yves Bourbonnais
1, Jean-Simon Brouard
1, Hélène Deveau
1, Arnaud Droit
2,
Stéphane M. Gagné1, Michel Guertin
1, Claude Lemieux
1, Louis Rathier
3, Steve J. Charette
1 and Patrick
Lagüe1*
1Département de Biochimie, de Microbiologie et de Bio-informatique, Faculté des sciences et de génie,
Université Laval, Québec (Québec), Canada
2Centre Hospitalier de l’Université Laval, Faculté de Médecine, Université Laval, Québec (Québec),
Canada
3Équipe de soutien informatique, Faculté des sciences et de génie, Université Laval, Québec (Québec),
Canada
Correspondence:
Corresponding Author
SUPPLEMENTARY MATERIAL
2
TABLE OF CONTENT
Table S1 Topics, tools and subjects covered in the BIF-1901 course .……. P. 3
Brief descriptions of the assignments………………………………………. P. 4
BIF-1901 Syllabus (Translated English version) .………………………….. P. 9
BIF-1901 Plan de cours (Original French version) …………………………P. 16
3
Table S1. Topics, tools and subjects covered in the BIF-1901 course.
Topic Tools Subjects covered
Informatics
GNU/Linux NoMachine, FileZilla Bash command line
Remote access to a server
Biological sequences
The databases GenBank, PDB,
UCSC, ExPASy Creation of databases
Type of data
Specific VS general databases
Cross information between databases
Ontology
Sequencing and assembly Greedy algorithm:
SSAKE, VCAKE,
SHARCGS
Graph algorithm:
Velvet, ABySS, Ray
Tablet
Sanger sequencing
Human genome project
Next-generation sequencing
Sequence assembly (with and without a reference)
Quality score (PHRED)
Sequence alignments LALIGN, BLAST,
Muscle, Jalview
Homology
Global and local alignments
Substitution matrices and gap score
Multiple alignment
Databases of model organisms Saccharomyces
Genome Database, Dictybase,
Pseudomonas
Genome database,
Mouse Genome
Informatics,
SignalP, Phobius, TMHMM, NetPhos
Using online resources on model organisms
Gene ontology (GO)
Determining motifs and domains of a protein
Predict the cellular localization of a protein
Predict the post-translational modifications of a
protein
Phylogenetics Muscle, Jalview,
Gblocks
ClustalX, PhyML,
NJplot
Phylogenetic vocabulary
Rooted and unrooted trees
Multiple alignment
Filtration
Phylogenetic models
Phylogenetic methods
Structural bioinformatics
3d structure of protein PyMOL Understand the structural determinants of a protein
Visualizing the 3D structure of a protein
Generate a publishing grade quality image
Protein structure prediction chofas, SCRATCH,
Jpred 3, Modeller,
PyMOL
Crystallography
Nuclear magnetic resonance
Secondary prediction
3D structure prediction
Molecular modeling and docking Poseview, PyMOL,
Autodock Tools,
Autodock Vina,
Molinspiration
Energy minimization
Conformational search
Molecular dynamics
Molecular docking
Understanding the notion of force field
4
BRIEF DESCRIPTIONS OF THE ASSIGNMENTS
5
Assignment 1 (Module 1: Introduction to Bioinformatics, Software and Linux)
The main goals of the first assignment are to (1) determine the knowledge of the students in basic
BASH command lines, and (2) ensure that they all have the informatics skills and materials to complete
the course. Students have to use a bioinformatics server running under the operating system
GNU/Linux through the software NoMachine installed on their computer. Before this assignment
became part of the course, some students waited to be close to the date of homework to access the
server for the first time. Sometimes, these students had technical issues that dramatically slowed their
learning and their ability to do their assignment within the required timeframe.
To complete this assignment, students should connect to the server, open a terminal window and take
screen captures of their desktop showing the commands and the results of the following BASH
command lines introduced in a clip on the course’s portal:
1) Go to the “Desktop” folder, and use the command “pwd” to print the folder’s path.
2) Create a folder with the name based on your login name, and print the list of the files of the
“Desktop” folder.
3) Create an empty file named “homework1.txt” in the “Desktop” folder, and print the content of
the “Desktop” folder.
4) Copy the file “homework1.txt” from the “Desktop” folder to the folder based on your login
name, and print the content of this folder.
5) Remove the file “homework1.txt” from the “Desktop” folder and print the content of the
“Desktop” folder.
6) Go to the folder based on your login name and change the name of the file “homework1.txt” for
“loginname-homework1.txt” where loginname is your login name.
7) Print all the running processes on the server (students have to find this command by themselves
as it is not taught in the course).
8) Print the history of the commands used to achieve the assignments until this step.
6
9) Print the human readable file list of the folder based on your login name.
10) Go back to the “Desktop” folder and create an archive (a “tar” file) of the folder based your
login name and all the screen captures created in the previous steps. Deposit this archive file in
the appropriate section of the course’s portal.
Assignment 2 (Module 4: Sequence Alignment and Database Search (1st part) and Module 5:
Sequence Alignment and Database Search (2nd part))
This assignment is a practical introduction to protein’s dataset retrieval and sequence alignment. To
complete this assignment, the students have to:
1) Choose a dataset of curated sequences in the Protein Clusters of the NCBI
(https://www.ncbi.nlm.nih.gov/proteinclusters). The dataset should contain between 15 and 30
sequences having between 300 and 600 amino acids in length. Also, since this dataset will also
be used for the third assignment on the phylogenetic analysis and trees, the sequences should be
from taxonomically diverse organisms (several genera of bacteria, for example).
2) Create a FASTA file with the dataset of the previous step, and manually edit this file to add a
short description in the title of each sequence included in the file.
3) Transfer the FASTA file on the bioinformatics server, and realize a multiple sequence
alignment using the MUSCLE software.
4) Open in JALVIEW the file created in step 3, and color the residues of the alignment having an
identity of at least 80% using the ClustalX color format. The results should be printed in a file
using the EPS file format and deposited in the appropriate section of the course’s portal.
Assignment 3 (Module 7: Phylogenetic Analysis and Module 8: Building Phylogenetic Trees)
In this assignment, each student has to construct a maximum likelihood tree using the protein dataset
that was selected for the second assignment. Using the same dataset help students to have interests in
obtaining results since they choose the sequences and concretely see the evolution of their work. The
robustness of the inferred tree is also evaluated by analysis of bootstrap replicates. Detailed instructions
are provided in a clip deposited on the course’s portal. The assignment involves the following steps:
7
1) Removal of the ambiguous regions in the sequence alignment that was generated in the second
assignment using GBLOCKS.
2) Conversion of the filtered FASTA file to the PHYLIP format using the CLUSTALX software
program.
3) Inference of a maximum likelihood tree and evaluation of its robustness using the PHYML
program.
4) Visualization and edition of the phylogenetic tree, and production of an illustration using the
interactive NJPLOT program.
Students should deposit on monPortail the original alignment and the filtered one, the resulting
inferred tree as well as a short text discussing the resulting topology.
Assignment 4 (Module 9: 3D Structure of Proteins and Module 10: Predicting the Structure of
Proteins)
This assignment is a practical introduction to PyMOL and the preparation of publication-quality
images, and the prediction of protein 3D structures. To complete this assignment, the students in teams
of 2 or 3 have to:
1) Build a publication-quality (ray-traced) image of the protein PDB 3RGK using PyMOL.
2) For a given amino acid sequence (UniprotKB R4QRB9), find in the PDB the homologous
proteins with a 3D structure. For each sequence, provide the PDB code, the name of the protein,
the percentages of identity and similarity. Provide also the alignments.
3) Predict the secondary structure of the amino acid sequence using three different methods.
4) Predict the 3D structure by homology modeling using the PDB 3VNF as a template and
MODELLER on the bioinformatics server.
5) Using PyMOL, build a publication-quality image that compares both the predicted structure and
the template structure. Save the PyMOL session used to build the image in a PSE file.
6) Deposit the files in the appropriate section of the course’s portal.
8
Assignment 5 (Module 11: Introduction to Molecular Modeling and Module 12 Molecular
Docking)
For this assignment, the students are required to complete two docking simulations using Autodock
Vina and its PyMOL plugin. The different steps of the simulations are presented in a series of clips on
the course’s portal. Each team of 2 or 3 students performs their docking simulations on the
bioinformatics server, following the steps presented in the clips but using a different protein-ligand
complex from the PDB database, and discusses the results in a report.
For example, a student team is assigned to the protein-ligand complex PDB 10GS (Human glutathione
S-transferase P1-1, complexed with the ligand TER117). The students are required to:
1) Present the protein-ligand complex in the Introduction of the report, including the biological
function of the protein and the nature of the ligand (inhibitor, natural substrate, cofactor, etc.).
2) Identify the ligand in the complex and determine the molecular interactions between the protein
and the ligand. A PyMOL image and a Table reporting the interactions are included in the
report.
3) Prepare the protein and the ligand for docking using PyMOL.
4) Determine the docking parameters using the Vina’s PyMOL plugin and run the docking
simulation.
5) Modify the ligand using PyMOL, with the expectation of increasing the ligand’s affinity for the
protein.
6) Run a docking simulation for the modified ligand.
7) Include PyMOL images and a table of the docking results in the report.
8) Include in the report a discussion of the docking results in respect of the protein-ligand
interactions identified in step 2 and of the chemical modifications of step 5.
9
BIF-1901 SYLLABUS
(TRANSLATED ENGLISH VERSION)
10
SYLLABUS
BIF-1901: Introduction to Bioinformatics and Bioinformatics Tools
Faculty of Science and Engineering
Department of Biochemistry, Microbiology and Bioinformatics
Mode of instruction: Online course
Credits: 3
Note: This Syllabus is a translation of the essential sections of the original Lesson Plan of the course,
available online at www.ulaval.ca, and provided as supplementary material.
Description
The course aims to introduce students to the various fields of applications of bioinformatics. The course
emphasis is on learning the main tools of bioinformatics related to:
• databases of sequences and structures
• methods of sequencing and assembling genomes
• analysis and alignment of amino acid sequences
• phylogenetic analysis
• modelling the structure of proteins from their sequence
• modelling protein-ligand interactions (e.g., antibiotic, substrate, inhibitor) from molecular
docking
The students are also introduced to the concepts of systems biology.
This course is offered online. For more information, see the course page at www.distance.ulaval.ca.
Objectives
At the end of the course, students will be able to:
• Describe the major fields of application and challenges in bioinformatics
• Effectively retrieve information from biological databases and understand their importance in
bioinformatics
• Understand the main theoretical and practical aspects of:
o aligning and assembling DNA and protein sequences
o similarity searches in databases
o phylogenetics
o prediction of the 3D structure of protein
o molecular modelling
• Use various specialized resources to characterize patterns and domains of a protein, its location
in the cell, and its likely post-translational modifications
• Compare the experimental methods used to determine the 3D structure of proteins at the atomic
level and tools for predicting the structure of proteins
• Understand the importance of modelling and molecular docking in the study of biological
molecules
11
Educational approach: a note to students
This course is designed according to a pedagogical approach specific to online learning. The teaching
materials and the method allow students to adopt a relatively autonomous learning approach. You can
manage your own study time and be responsible for your own learning.
The person responsible for the course will remain available to support you throughout the session. The
role of this person is to facilitate the learning conditions and to help you in your approach so that you
achieve the objectives of the course. You can communicate with this person by various means: send an
e-mail for more personal questions, and use the forum for issues of general interest that will benefit the
entire class. The modalities of supervision (response time, availability, etc.) are further described in the
Lesson Plan.
The course website and the textbook (Essential Bioinformatics) contain all the teaching materials
required to take the online course, including booklets, lectures, demonstrations, exercises, and anything
else you might need. Each week, you are invited to consult the Content and Activities tab of the module
for the description of the learning and evaluation activities planned. In general, the schedule proposed
in the Content and Activities section is flexible and can adapt to your time schedule within the space of
the week. Online training allows you to learn at your own pace; however, by adopting a regular
learning schedule from the beginning of the course, you will be able to benefit from regular feedback
from the person providing supervision (i.e. plan to do your class requirements at the beginning of each
week to allow time to ask questions and resolve difficult content questions.) You remain, of course, the
only person who manages your schedule, but you must commit to performing the homework and
summative evaluations at the prescribed times (see the Evaluations and Results tab).
For the duration of the session, you will have access to a bioinformatics server dedicated to the needs
of the course. This server is hosted at the Faculty of Science and Engineering at IP
XXX.XXX.XXX.XXX and runs on Linux. To complete the assignments, you will need to log in
remotely to the server and use some of the bioinformatics programs installed there. All the information
needed to connect and work properly with the server will be presented in Module 1.
The expected duration of the course is 15 weeks. It is divided into 12 modules, each tackling a specific
theme. Typically, each module has a duration of one week. The amount of work required to complete
the modules and the evaluations is about 135 hours. On average, the weekly workload is about 9 hours,
but some modules are longer and others are shorter. To access the modules, go to the Content and
Activities tab from the course website.
In each module, you will find the following information:
Content tab
• Introduction
• Specific objectives
Activities tab
• Learning activities: instructions detailing the work to be done for a given module
• Learning resources: mandatory instructional material (text, videos, clips, etc.)
Complementary Tab
12
• For each module, links to interesting website addresses are offered.
Summative evaluation Tab
• This tab will only be available for the modules where you need to complete a summative
evaluation. It will contain hyperlinks and evaluation information.
The Student Aid Center offers advice on how to succeed academically. They can help you improve
your learning strategies, help with basic content, and help you in the management of your study time.
See (https://www.aide.ulaval.ca/cms/site/aide).
Terms and conditions
The feedback provided by the supervisor can take different routes. This course focuses on two means
of coaching: e-mail and discussion forums. It is important to be aware that responses to e-mail
questions will not be instantaneous. In this course, the supervisor will reply within two working
days. In order to avoid delays, it is recommended that you send e-mail only for personal questions, and
use the forum for general content questions. Please be clear and explicit in your questions and
comments (e.g., specify document names and references).
In addition, you can also use the discussion forums to discuss various content issues with other
students. As you study remotely, the forum is a tool that allows you to converse with your colleagues
and with the person providing the supervision. In this course, there will be three types of forums:
• the forums specific to each module, where you can ask discuss and questions about each content
module. • the General Questions forum where you can ask questions about the administrative aspects of
the course
• the Technical Aspects forum where you can ask questions about the connection to the course
server, transfer of files or other technical aspects important to the success of the course
To facilitate timely responses to your needs and the management of the forums, be sure to ask your
questions in the right section. Be explicit in the title of your messages for the best conversations and
responses.
We are committed to answer or validate your responses within two business days. Face-to-face support
will be available on Monday afternoons on the campus.
Course Goal
The aim of this course is to introduce the student to the vocabulary and the theory of bioinformatics and
to provide them with hands-on experience with specialized bioinformatics tools and programs.
Introduction
The course BIF-1901: Introduction to bioinformatics and its tools aims to introduce students to the
various fields of application of bioinformatics. The course focuses on learning the main tools of
bioinformatics, including:
sequencing databases
genome sequencing and assembly methods,
the alignment of nucleic acid and protein sequences
modelling of protein structures, based on their sequences
modelling of enzyme-substrate interactions by molecular docking and phylogenetic analysis
13
This course is intended for last-year students enrolled in bachelor's programs in Biochemistry or
Microbiology. The course BCM-2000 Molecular Genetics II or BIO-2003 Molecular Biology is a
prerequisite to the course.
In addition, to take this course, you will have to master some basic computer skills. For example, at the
beginning of the session, it is your responsibility to ensure that you are confident in working remotely
in a Linux environment. All of the necessary information for the computer requirements will be given
to you in the first module of the course.
This lesson plan presents all the information necessary for participation in this course. It includes the
instructions for the teaching materials you will use, information about how much leeway you may take
in the paths you choose to follow, and about the different requirements to which you will have to fulfill.
Happy reading and good luck with the course!
Course content and activities
The table below presents a week-by-week plan of the course activities.
Week 1 - Module 1: Introduction to Bioinformatics, Software and Linux Sep 5, 2016
Week 2 - Module 2: Biological Databanks Sep 12, 2016
Week 3 - Module 3: Sequencing and Assembly Techniques Sep 19, 2016
Week 4 - Module 4: Sequence Alignment and Database Search (1st part Sep 26, 2016
Week 5 - Module 5: Sequence Alignment and Database Search (2nd part) Oct 3, 2016
Week 6 - Free Work Oct 10, 2016
Week 7 - Module 6: Model Organization Databases and Prediction of
Patterns and Functions of Proteins Oct. 17, 2016
Week 8 - Module 7: Phylogenetic Analysis Oct 24, 2016
Week 9 - Reading Week Oct 31, 2016
Week 10 - Module 8: Building Phylogenetic Trees Nov. 7, 2016
Week 11 - Module 9: 3D Structure of Proteins Nov 14, 2016
Week 12 - Module 10: Predicting the Structure of Proteins Nov 21, 2016
Week 13 - Module 11: Introduction to Molecular Modeling Nov 28, 2016
Week 14 - Module 12: Molecular Docking Dec 5, 2016
Week 15 - Free Work Dec 12, 2016
Note: Please refer to the Content and Activities section of the course website for further details.
Evaluations and results
Assignment instructions. Each assignment will be accompanied by detailed instructions. Instructions
may vary between modules as different professors are responsible for each. Unless otherwise stated,
homework should be uploaded to the drop box of the course website. Homework should be done
individually unless otherwise indicated. A late penalty of 10% will be applied to homework.
DO NOT TO WAIT UNTIL THE LAST MINUTE to start the homework. Homework consists of
practical exercises that use computer tools, and time-consuming problems may arise, related to
particular hardware or software configurations. Sometimes a few hours or a few days are required to
deal with these problems. If a delay in handing in homework is the result of a delay related to a
breakdown that couldn’t get resolved in time because you started too late, you are still responsible for
14
the lateness of your assignment. Part of learning to use these computer tools is a realistic and practical
outlook in how long they can take, and planning for smooth handling of contingencies in your projects.
List of evaluations
Online evaluations
Online evaluation #1 From Oct. 13, 2016 9:00 AM Individual 10%
(Modules 4 and 5) To Oct. 14, 2016 4:00 PM
Online evaluation #2 From Oct. 27, 2016 9:00 AM Individual 15%
(Module 6) To Oct. 28, 2016 4:00 PM
Homework
Homework #1 (Module 1) Sept. 23, 2016 at 12:59 PM Individual 15%
Homework #2 (Modules 4&5) Oct. 14, 2016 at 12:59 PM Individual 10%
Homework #3 (Module 7&8) Nov. 18, 2016 at 12:59 PM Individual 15%
Homework #4 (Module 9&10) Dec. 2, 2016 at 12:59 PM Team 15%
Homework #5 (Module 11&12) Dec. 16, 2016 at 12:59 PM Team 20%
Grading Scale
Course Materials
Required Materials
Essential bioinformatics
Author: Xiong, Jin
Publisher: Cambridge University Press (Cambridge New York, 2006)
ISBN: 9780521600828
Software
Most of the required computer programs are already installed on the course server. However, you are
required to install the following (free) software:
Remote connection to the course server
• Nomachine (http://www.nomachine.com/download.php)
• FileZilla client (https://filezilla-project.org/download.php?show_all=1).
15
Modules 9 and 10: 3D structure of proteins
• PyMOL (http://pymol.org)
• Foldit (http://fold.it)
• Folding @ Home (http://folding.stanford.edu/English/Download)
Technological Specifications
To take this course you will also need one of the authorized calculators, which does not have internet
connectivity, as shown in the Policy on the Use of Electronic Devices during an Assessment
(https://www.fsg.ulaval.ca/fileadmin/fsg/documents/PDF/appareils_electro.pdf).
To be able to follow this course you will need to have, or have access to, the following tools:
• A computer with the applications required for web browsing
• An Internet connection (minimum intermediate speed, high speed recommended)
• Speakers or headphones
Mediagraphy and Annex
Bibliography
• Claverie, J. M. et C. Notredame. 2007. Bioinformatics for dummies. 2nd Edition. Wiley
Publishing, Inc. ISBN : 9780470089859
• Edwards, D., Stajich, J. E. and D. Hansen. 2009. Bioinformatics : tools and applications.
Springer. ISBN : 0387927379
• J. Pevsner. 2003. Bioinformatics and Functionnal Genomics. 1st Edition. John Wiley & Sons,
Inc. ISBN : 0471210048
• Rodríguez-Ezpeleta, N., Hackenberg, M. and A. Aransay. 2012. Bioinformatics for high
throughput sequencing. Springer. ISBN : 9781461407812
• J. Xiong. 2006. Essential Bioinformatics. Cambridge University Press. ISBN : 9780521600828
• Zvelebil, M. and J. O. Baum. 2008. Understanding Bioinformatics. Garland Science, Taylor &
Francis Group, LLC. ISBN : 9780815340249
16
BIF-1901 PLAN DE COURS
(ORIGINAL FRENCH VERSION)
17
18
19
20
21
22
23
24
25
26
27
28
29