34
www.cmmt.ubc.ca JASPAR BioPython & MANTA Anthony Mathelier, David Arenillas & Wyeth Wasserman [email protected] & [email protected] Wasserman Lab

Webinar about JASPAR BioPython module and MANTA

Embed Size (px)

Citation preview

www.cmmt.ubc.ca

JASPAR BioPython & MANTA

Anthony Mathelier, David Arenillas & Wyeth Wasserman

[email protected] & [email protected]

Wasserman Lab

2 2

Outline

● JASPAR BioPython module– What is JASPAR?– How to construct matrices from JASPAR files using

the JASPAR BioPython module.

● MANTA– What is stored in MANTA?– How to interrogate the MANTA DB using Python and

our web application.

3 3

http://jaspar.genereg.net

Mathelier et al. JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 2014 PMID 24194598

4 4

Modelling Transcription Factor Binding Sites (TFBS)

A [ 1 0 19 20 18 1 20 7 ]C [ 1 0 1 0 1 18 0 2 ]G [17 0 0 0 1 0 0 3 ]T [ 1 20 0 0 0 1 0 8 ]

Example: FOXD1PFM – Position Frequency Matrix

Logo

gctaaGTAACAATgcgcacttaaGTAAACATcgctcccaatGTAAACAAacggagaaagGTAAACAAtgggc GTAAACATgtactcttgtGTAAACAAaaagccttaaGTAAACACgtccgcttatGTCAACAGtgggt tGTAAACATtgcat GTAAACAAtgcgacttagGTAAACATtttcgTTAAGTAAaca caaaATAAACAAcgtgcgctaaCTAAACAGagagagtgttGTAAACATtggaa taatGTAAACAAtgcgggaaagGTAAACATaagaacctaaGTAAACACaacgccctaaGTAAACATtcttatGTAAACAGaggtc

Known binding sites

5 5

Scoring putative TFBS sequences

A  [ 1  0 19 20 18  1 20  7 ]C  [ 1  0  1  0  1 18  0  2 ]G  [17  0  0  0  1  0  0  3 ]T  [ 1 20  0  0  0  1  0  8 ]

A  [­1.5 ­2.5  1.7  1.8  1.6 ­1.5  1.8  0.4 ]C  [­1.5 ­2.5 ­1.5 ­2.5 ­1.5  1.6 ­2.5 ­1.0 ]G  [ 1.6 ­2.5 ­2.5 ­2.5 ­1.5 ­2.5 ­2.5 ­0.6 ]T  [­1.5  1.8 ­2.5 ­2.5 ­2.5 ­1.5 ­2.5  0.6 ]

A C G A G T T A A A C A A G C T AA  [­1.5 ­2.5  1.7  1.8  1.6 ­1.5  1.8  0.4 ]C  [­1.5 ­2.5 ­1.5 ­2.5 ­1.5  1.6 ­2.5 ­1.0 ]G  [ 1.6 ­2.5 ­2.5 ­2.5 ­1.5 ­2.5 ­2.5 ­0.6 ]T  [­1.5  1.8 ­2.5 ­2.5 ­2.5 ­1.5 ­2.5  0.6 ]

Score = 9.2

PFM PWM – Position Weight Matrix

PWM Sum score at each position

(aka PSSM – Position Specific Scoring Matrix)

6 6

Overview of the JASPAR 2014 database

7 7

JASPAR Biopython modules

➢ Bio.motifs.jaspar

➢ Read / write motifs encoded in the JASPAR flat file formats: sites, PFM and jaspar

➢ Bio.motifs.jaspar.db

➢ Search / fetch motifs from a JASPAR formatted database.

http://biopython.org*

*Cock et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009 Jun 1;25(11):1422-3. PMID: 19304878

Extend Biopython's Bio.motifs module to support construction of TFBS matrices from JASPAR supported formats.

8 8

Constructing a matrix from a JASPAR sites formatted file

The JASPAR sites format consists of a list of known binding sites for a motif.

9 9

Constructing a matrix from a JASPAR pfm formatted file

The JASPAR pfm format simply describes a frequency matrix for a single motif.

10 10

Constructing matrices from a JASPAR jaspar formatted file

Note the use of the parse rather than the read method to read multiple motifs.

The JASPAR jaspar format allows for multiple motifs. Each record consists of a header line followed by four lines defining the frequency matrix.

11 11

Constructing matrices from a JASPAR jaspar formatted file cont'd

The frequency portions of the file can be specified in a simpler format identical to the pfm format.

12 12

The JASPAR DB module

Connect to a JASPAR database:

Modelled after the Perl TFBS modules*.

Specifically, the Bio.motifs.jaspar.db.JASPAR5 BioPython class is modelled after the TFBS::DB::JASPAR5 perl class.

Fetch a specific motif by it's JASPAR ID:

* Lenhard et al. TFBS: Computational framework for transcription factor binding site analysis. Bioinformatics. 2002 PMID 12176838

13 13

JASPAR DB module cont'dFetch multiple motifs according to various attributes.

Example: fetch the motifs of all the vertebrate and insect transcription factors from the CORE JASPAR collection which are part of the Forkhead family and which have an information content of at least 12 bits:

Note that selection criteria (such a 'tax_group' and 'tf_family') which allow multiple values may be specified either as a single value or as a list of values.

14 14

For more information...

For an overview and examples of using these modules, please see the JASPAR sub-section under the “Reading motifs” section of the BioPython Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html

For more technical information see the Bio.motifs.jaspar section of the BioPython API docs: http://biopython.org/DIST/docs/api

15 15

MANTA

MongoDB for Analysis of TFBS Alteration

Mathelier et al. Cis-regulatory somatic mutations and gene-expression alteration in B-cell lymphomas. Genome Biology. 2015. PMID 25903198

16 16

MANTA

DB

...gctaaGTAACAATgcgca...

...cttaaGTAAACATcgctc...

...ccaatGTAAACAAacgga...

Adapted from Szalkowski and Schmid (2010). Briefings in Bioinfomatics.

17 17

MANTA Statistics

ChIP-seq experiments 477

Transcription factors 103

TFBSs 9,510,336

Unique bases covered 76,160,599 (~2.25% of the human genome)

AMIA TBI&CRI March 19th-23rd, 2012 18

18

Variations may impact TF binding

TF

Binding sequence

Mutated binding sequence

Transcription initiated

Transcription fails to initiate

TF recognizes binding site

TF fails to recognize binding site

Exon

Exon

5’ UTR

5’ UTR

AGCTAGCTATATTTAAACAACACTGTCTAGCATTGCCTGATAGATGAGCCGTCGCAGCTGGA

AGCTAGCTATATTTAATCCACACTGTCTAGCATTGCCTGATAGATGAGCCGTCGCAGCTGGA

TFTF

19 19

DNATFBS

Assessing the impact of variations on TF binding

20 20

DNASNV

Assessing the impact of variations on TF binding

21 21

DNASNV

Assessing the impact of variations on TF binding

22 22

DNASNV

Assessing the impact of variations on TF binding

23 23

DNASNV

Assessing the impact of variations on TF binding

24 24

DNASNV

Assessing the impact of variations on TF binding

25 25

DNASNV

Record best TFBS hit with the mutated sequence

Assessing the impact of variations on TF binding

26 26

DNATFBS

0.80 0.85 0.90 0.95 1.00 1.05 1.10

01

23

45

67

alt/ref

Density

Assessing the impact of variations on TF binding

27 27

DNASNV

0.80 0.85 0.90 0.95 1.00 1.05 1.10

01

23

45

67

alt/ref

Density

Alternative

Assessing the impact of variations on TF binding

28 28

Example of Application of MANTA

Mathelier et al. Cis-regulatory somatic mutations and gene-expression alteration in B-cell lymphomas. Genome Biology. 2015. PMID

29 29

The MANTA Database

Implemented with MongoDB (http://www.mongodb.org)

Consists of 3 collections:

Experiments

- experiment name, type, TF name, JASPAR matrix ID, etc.

Peaks

- peak position (chromosome, start, end), score, position of maximum peak height, etc.

TFBSs / SNVs

- position (chromosome, start, end), strand, score for the unmutated TFBS plus similar information and impact score for each position / alt. allele mutation.

30 30

MANTA DB with Python

Example: connect to MANTA DB and fetch all TFBS affected by an SNV at position 6425005 on chromosome 19.

31 31

MANTA Web Interface

URL: http://manta.cmmt.ubc.ca/manta

Source code: https://github.com/wassermanlab/MANTA

32 32

33 33

34 34

Thanks!

Any questions?

Contacts:Anthony Mathelier, [email protected] Arenillas, [email protected]

URLs:Wasserman Lab: www.cisreg.caBioPython: http://biopython.orgMANTA: manta.cmmt.ubc.ca/manta