87

2018 02 20_biological_databases_part1_v_upload

Embed Size (px)

Citation preview

FBW

20-02-2018

Biological Databases

Wim Van Criekinge

Math

Informatics

Bioinformatics, a scientific discipline ? Or the new (molecular) biology ?

Theoretical Biology

Computational Biology

(Molecular)

Biology

Computer Science

Bioinformatics

Lab for Bioinformatics and

computational genomics

Statistics

Machine Learning

Text Mining

Bioinformatics

Discovery Informatics

Informatics(Molecular)

Biology

Statistics

Machine Learning

Text Mining

Python, …

Biological Databases

Bioinformatics

Discovery Informatics

(Molecular)

Biology

The most valuable programming skills to have on a resume

New kid in the coding block …

Statistics

Machine Learning

Text Mining

Python, …

Biological DatabasesEpigenetics

Bioinformatics

Discovery Informatics

Sander-Schneider

• HSSP: homology derived secondary structure

Usage of the databases

Annotation searches - Search for keywords, authors, features

Usage of the databases

Annotation searches - Search for keywords, authors, features

What is the protein sequence for human insulin?

How does the 3D structure of calmodulin look like?

What is the genetic location of the cystic fibrosis gene?

List all intron sequences in rat.

Usage of the databases

Annotation searches - Search for keywords, authors, features

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Is there any known protein sequence that is similar to x?

Is this gene known in any other species?

Has someone already cloned this sequence?

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Do my protein sequence contain any known motif

(that can give me a clue about the function)?

Which known sequences contain this motif?

Is any part of my nucleotide sequence recognized

by a transcriptional factor?

List all known start, splice and stop signals in my

genomic sequence.

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Predictions - Using the databases as knowledge databases

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Predictions - Using the databases as knowledge databases

What may the structure of my protein be?

Secondary structure prediction.

Modelling by homology.

What is the gene structure of my genomic sequence?

Which parts of my protein have a high antigenicity?

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Predictions - Using the databases as knowledge databases

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Predictions - Using the databases as knowledge databases

Comparisons

Usage of the databases

Annotation searches - Search for keywords, authors, features

Homology (similarity) searches - Search for similar sequences

Pattern searches - Search for occurrences of patterns

Predictions - Using the databases as knowledge databases

Comparisons

Gene families

Phylogenetic trees

Les 1

• Bioinformatics I Revisited in 5 slides

• Why bother making databases ?

• DataBases– FF

• *.txt

• Indexed version

– Relational (RDBMS)• Access, MySQL, PostGRES, Oracle

– OO (OODBMS)• AceDB, ObjectStore

– Hierarchical• XML

– Frame based system • Eg. DAML+OIL

– Hybrid systems

GenBank Format

LOCUS LISOD 756 bp DNA BCT 30-JUN-1993

DEFINITION L.ivanovii sod gene for superoxide dismutase.

ACCESSION X64011.1 GI:37619753

NID g44010

KEYWORDS sod gene; superoxide dismutase.

SOURCE Listeria ivanovii.

ORGANISM Listeria ivanovii

Eubacteria; Firmicutes; Low G+C gram-positive bacteria;

Bacillaceae; Listeria.

REFERENCE 1 (bases 1 to 756)

AUTHORS Haas,A. and Goebel,W.

TITLE Cloning of a superoxide dismutase gene from Listeria ivanovii

by functional complementation in Escherichia coli and

characterization of the gene product

JOURNAL Mol. Gen. Genet. 231 (2), 313-322 (1992)

MEDLINE 92140371

REFERENCE 2 (bases 1 to 756)

AUTHORS Kreft,J.

TITLE Direct Submission

JOURNAL Submitted (21-APR-1992) J. Kreft, Institut f. Mikrobiologie,

Universitaet Wuerzburg, Biozentrum Am Hubland, 8700

Wuerzburg, FRG

Problems with Flat files …

• Wasted storage space

• Wasted processing time

• Data control problems

• Problems caused by changes to data structures

• Access to data difficult

• Data out of date

• Constraints are system based

• Limited querying eg. all single exon GPCRs (<1000 bp)

• What is a relational database ?– Sets of tables and links (the data)

– A language to query the datanase (Structured Query Language)

– A program to manage the data (RDBMS)

• Flat files are not relational– Data type (attribute) is part of the data

– Record order mateters

– Multiline records

– Massive duplication• Bv Organism: Homo sapeinsm Eukaryota, …

– Some records are hierarchical• Xrefs

– Records contain multiple “sub-records”

– Implecit “Key”

• records

• fields

• linear file of

homogeneous records

name.........................

surname....................

phone........................

address......................

name.........................

surname....................

phone........................

address......................

name.........................

surname....................

phone........................

address......................

name.........................

surname....................

phone........................

address......................

name.........................

surname....................

phone........................

address......................

name.........................

surname....................

phone........................

address......................

name.........................

surname....................

phone........................

address......................

name.........................

surname....................

phone........................

address......................

Introduction to Database Systems

• Historic Background

– Hierarchical databases (IMS) - IBM 1968

• Hierarchical structures between file records

– Network databases - CODASYL Group 1969

• Network structures of record types

• Linked chains between 'Owner' and 'Member' records

• Included in Cobol, procedural language - Manual

navigation

– Relational Data Model - E. F. Codd 1970

• Mathematical foundation of databases

• New non-procedural language SQL - Automatic

navigation

– Object-relational databases

– Object-oriented databases

Relational

• The Relational model is not only very mature, but it

has developed a strong knowledge on how to make a

relational back-end fast and reliable, and how to

exploit different technologies such as massive SMP,

Optical jukeboxes, clustering and etc. Object

databases are nowhere near to this, and I do not

expect then to get there in the short or medium term.

• Relational Databases have a very well-known and

proven underlying mathematical theory, a simple one

(the set theory) that makes possible

– automatic cost-based query optimization,

– schema generation from high-level models and

– many other features that are now vital for mission-critical

Information Systems development and operations.

The Benefits of Databases

• Redundancy can be reduced

• Inconsistency can be avoided

• Conflicting requirements can be

balanced

• Standards can be enforced

• Data can be shared

• Data independence

• Integrity can be maintained

• Security restrictions can be applied

Relational Terminology

ID NAME PHONE EMP_ID

201 Unisports 55-2066101 12

202 Simms Atheletics 81-20101 14

203 Delhi Sports 91-10351 14

204 Womansport 1-206-104-0103 11

Row (Tuple)

Column (Attribute)

CUSTOMER Table (Relation)

Relational Database Terminology

• Each row of data in a table is uniquely identified by a primary key (PK)

• Information in multiple tables can be logically related by foreign keys (FK)

ID LAST_NAME FIRST_NAME

10 Havel Marta

11 Magee Colin

12 Giljum Henry

14 Nguyen Mai

ID NAME PHONE EMP_ID

201 Unisports 55-2066101 12

202 Simms Atheletics 81-20101 14

203 Delhi Sports 91-10351 14

204 Womansport 1-206-104-0103 11

Table Name: CUSTOMER Table Name: EMP

Primary Key Foreign Key Primary Key

Relational Database Terminology

Relational operators

• Relational

– select

rel WHERE boolean-xpr

– project

rel [ attr-specs ]

– join

rel JOIN rel

– divide by

rel DIVIDEBY rel

• Set-based

rel UNION rel

rel INTERSECT rel

\

rel MINUS rel

rel TIMES rel

Disadvantages

• size

• complexity

• cost

• Additional hardware costs

• Higher impact of failure

• Recovery more difficult

• RDBM products

– Free

• MySQL, very fast, widely usedm easy to jump into but limited non standard SQL

• PostrgreSQL – full SQLm limited OO, higher learning curve than MySQL

– Commercial

• MS Access – Great query builder, GUI interfaces

• MS SQL Server – full SQL, NT only

• Oracle, everything, including the kitchen sink

• IBM DB2, Sybase

Example 3-tier model in biological database

http://www.bioinformatics.be

Example of different interface to the same back-end database (MySQL)

BioSQL

Conclusions

• A database is a central component of any

contemporary information system

• The operations on the database and the mainenance

of database consistency is handled by a DBMS

• There exist stand alone query languages or

embedded languages but both deal with definition

(DDL) and manipulation (DML) aspects

• The structural properties, constraints and operations

permitted within a DBMS are defined by a data

model - hierarchical, network, relational

• Recovery and concurrency control are essential

• Linking of heterogebous datasources is central theme

in modern bioinformatics

What is to come ?

Basic outline

• Setup RDMBS

• OLTP Access through CLI, dedicated

client, PHP, Perl/Python

• OLAP Access through Perl/Python, R ..

Integration

• Cytoscape

Semantic Web

• noSQL/Hadoop

• SPARQL

Project

• Sciencecraft

• iGem

• BioDesignChallenge

• mHealth

• Social Genetics

3/05/2016 Project Biological Databases

2015-2016

Biological Databases

Bruno Verstraeten, Arthur Zwaenepoel,

Jules Haezebrouck, Laurenz De Cock, Jonathan

Walgraeve, Cedric Bogaert, Dries Schaumont

What is minecraft

• Sandbox game

• Designed by Markus “Notch” Persson

• Mojang

• Bought by Microsoft in 2014

• 70 million sold copies (june 2015)

Minecraft programming from Python

Third party mods

• Extra content made by users

• Adding items, magic and features to

the original game

• The true beauty of minecraft

And now Sciencecraft

• Visualizing proteins in minecraft

• Minecraft Tools python package

• Data directly from PDB flat files or

from the PDB server

• Spigot minecraft server

The basics

1. Start a server with Minecraft Tools

2. Using python import the pdb file

3. Retrieve the coordinates from the file

4. Using the setBlock function blocks of

specific colours are placed in the

minecraft server to represent the protein

5. Fly around and take screenshots

Minecraft programming from Python

# Connect to Minecraft

from mcpi.minecraft import Minecraft

mc = Minecraft.create()

# Set x, y, and z variables to represent coordinates

x = 10.0

y = 110.0

z = 12.0

# Change the player's position

mc.player.setPos(x, y, z)

Verotoxin

Apo-lipoprotein A1

Kinesine

Retrieving PDB data using SPARQL

• PDB available in RDF (wwPDB)

• Using python SPARQLwrapper

Using SPARQL with Python – SPARQLWrapper

SPARQL endpoint

Using SPARQL with Python – SPARQLWrapper

“Search engine”

• Naive regex based

• Returns list of all pdb

entries containing a

certain keyword with

organism name and

full description

• PDB entry can be

retrieved with previous

query

Retrieve .xml.gz file:

Actual structure information in xml file

<?xml version="1.0" encoding="UTF-8" ?>

<PDBx:datablock datablockName="1O9K"xmlns:PDBx="http://pdbml.pdb.org/schema/pdbx-v40.xsd"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://pdbml.pdb.org/schema/pdbx-

v40.xsd pdbx-v40.xsd">

<PDBx:atom_siteCategory><PDBx:atom_site id="1">

<PDBx:B_iso_or_equiv>62.42</PDBx:B_iso_or_equiv><PDBx:Cartn_x>13.258</PDBx:Cartn_x><PDBx:Cartn_y>142.706</PDBx:Cartn_y><PDBx:Cartn_z>30.410</PDBx:Cartn_z><PDBx:auth_asym_id>A</PDBx:auth_asym_id><PDBx:auth_atom_id>N</PDBx:auth_atom_id><PDBx:auth_comp_id>MET</PDBx:auth_comp_id><PDBx:auth_seq_id>379</PDBx:auth_seq_id><PDBx:group_PDB>ATOM</PDBx:group_PDB><PDBx:label_alt_id xsi:nil="true" /><PDBx:label_asym_id>A</PDBx:label_asym_id><PDBx:label_atom_id>N</PDBx:label_atom_id><PDBx:label_comp_id>MET</PDBx:label_comp_id><PDBx:label_entity_id>1</PDBx:label_entity_id><PDBx:label_seq_id>8</PDBx:label_seq_id><PDBx:occupancy>1.00</PDBx:occupancy>

….

Using SPARQL with Python – SPARQLWrapper

Project

• Sciencecraft

• iGem

• BioDesignChallenge

• mHealth

• Social Genetics

BIOSCIENCE

ENGINEERING

TOGETHER:

PARTICIPATING AT

IGEM

INTERNATIONAL GENETICALLY

ENGINEERED MACHINE➤Annual synthetic biology competition

➤Making new organisms: biobricks

➤Hosted by MIT: five teams in 2004, 130 teams in 2016

PAST IGEM WINNERS

2014

biosensor for olive

oil quality2015

3D printing of

biofilms 2016

system for the

control of co-

culture stability

UGENT 2016 TEAM

SOLVING WATER SHORTAGE

FOUR WORK PACKAGES

WP2: Filament

WP3: Biofunction

WP1: Shape

WP4: Measurement

WP1: SHAPE OPTIMISATION

Fogstand beetle

WP2: FILAMENT

WP3: BIOFUNCTION

+

lysatemembrane

WP4: FUNCTIONAL ASSAY

OUR INPUT

OUR INPUT

IN BOSTON: IGEM

CONFERENCE

Presenting, learning and having fun in Boston

FOLLOW UP➤ Maker City

➤ BrainBooster session

CropDesign

➤ Biodesign competition

➤ Bachelor project on 3D

printing

➤ PLOS iGEM collection

Next stepsIdeas for the next iGem

teams …

83

v2

84

• Thermodynamic compatible testing setting

• Real-life testing With UCSC (make bigger version ?) and/or green “plantable” versions for field tests (self-watering plant ?)

• Introduce temperature gradient ?? Blend current dewpalwith solar and/or wind energy source …

\

85

http://waterseer.org

http://fontus.at

86

… why go to the trouble of collecting water out of the air?

Why not simply cause more rain to fall?

With our INP ???

http://science.howstuffworks.com/environmental/earth/geophysics/manufacture-water1.htm

An even wilder idea …

Environmental remediation projects (e.g. ISS water recovery –change freezing point … contact Prof Arne Verliefde)

Molecular diagnostics - Liquid Biopsy project

Capture (methylated) cancer molecules from blood and/urine in oncology, we can run head-to-head samples from clinical trial in bladder and/or prostate cancer

Alternative projects

87