26
IPRStats: a Visualization Tool for InterProScan Iddo Friedberg Microbiology and Computer Science & Software Engineering Miami University http://github.com/devrkel/IPRStats.git

Friedberg bosc2010 iprstats

Embed Size (px)

Citation preview

Page 1: Friedberg bosc2010 iprstats

IPRStats: a Visualization Tool for InterProScan

Iddo FriedbergMicrobiology and

Computer Science & Software Engineering

Miami Universityhttp://github.com/devrkel/IPRStats.git

Page 2: Friedberg bosc2010 iprstats

Microbes are Everywhere

● 1030 prokaryotic cells on Earth (give or take a couple)

● Dominate the biosphere● 90% of the cells in your body

are prokaryotic (1014)● Found in the most hostile

environments

Page 3: Friedberg bosc2010 iprstats

Microbes do Everything● Nutrient reservoir:

● 4x1010 tons carbon (rivaling plants)

● 1x1010 tons Nitrogen● 1x109 tons phosphorous

almost

Page 4: Friedberg bosc2010 iprstats

Of course there is health...

● Communicable diseases

● Heart disease● Gastric cancer● Irritable Bowel

Syndrome

Page 5: Friedberg bosc2010 iprstats

...and Wellness

Page 6: Friedberg bosc2010 iprstats

Microbial Genomics

Phage phi-X174 1978: 5.5Kbp

H. influenzae 1995: 1.7Mbp

Page 7: Friedberg bosc2010 iprstats

Classic microbial genomics

Page 8: Friedberg bosc2010 iprstats

Classic microbial genomics

Page 9: Friedberg bosc2010 iprstats

Classic microbial genomics

Page 10: Friedberg bosc2010 iprstats

Microbes live in Communities& only 1% can be cultured

Page 11: Friedberg bosc2010 iprstats

What is Metagenomics?• Culture independent approach to study

microbial communities– < 1% of microbes can be cultured– DNA directly isolated from environmental sample

and sequenced

• Examining genomic content of organisms in community/environment to better understand:– Diversity of organisms– Their roles and interactions in the ecosystem

Page 12: Friedberg bosc2010 iprstats

Metagenomics is the Application of Genomics to Communities

Page 13: Friedberg bosc2010 iprstats

Some things we can learn using Metagenomics

● Taxonomic content: Taxon diversity in a habitat (using taxonomic markers)

• Functional content: biological functions, qualitative and quantitative profiles

• Coping with the environment: differences in functional content between habitats

• Decompose the biotic / abiotic elements in a habitat: metadata analysis

Page 14: Friedberg bosc2010 iprstats

A Metagenomic project

● Sequencing● Assembly● Diversity analysis● Annotation

● Gene finding ● Function prediction

● Diversity analysis● Comparative

analysis

Page 15: Friedberg bosc2010 iprstats

A Metagenomic project

● Sequencing● Assembly● Diversity analysis● Annotation

● Gene finding ● Function prediction

● Diversity analysis● Comparative

analysis

Page 16: Friedberg bosc2010 iprstats

A Metagenomic project

● Sequencing● Assembly● Annotation

● Gene finding ● Function prediction

● Diversity analysis● Comparative

analysis

Population analysis tools

Page 17: Friedberg bosc2010 iprstats

InterProScan● Signature search against an

integrated resource of domains and functional sites

● Easy to install, cluster-enabled (pleasantly parallel)

● Maintained by EBI

● Can annotate whole genomes

● PIR, Pfam, TIGRFam, Panther, Prodom, PRINTS,...

● Needs a visualization tool for population / metagenomic annotation

Page 18: Friedberg bosc2010 iprstats

IPRStats

File Help

PFAM

PIR

GENE3D

HAMAP

PANTHER

PRINTS

PRODOM

PROFILE

PROSITE

SMART

SUPERFAMILY

TIGRFAMs

Charting

Full Databases

Python SAX Parser

Aggreg ateQ

ueries

Resulting Tables

Open XML file

GUI: wxPythonExcel export: xlwt

Page 19: Friedberg bosc2010 iprstats

IPRStats(wx.Frame)

Results(sqlite or pytables)

Menu(wx.MenuBar)

PropertiesDlg(wx.Dialog)

Table(wx.PyGridTableBase)

standalone

HTML

XLS(using xlwt)

IPS

exporters

XML

IPS

importers

StatsData

Settings

IPRStats Architecture

Chart(wx.StaticBitmap)

Page 20: Friedberg bosc2010 iprstats

?What is PyTables?

- package for creating data structures that can handle large amounts of data- uses NumPy (for in memory) and HDF5 (for disk storage) structures- uses Numexpr (jit compiler) for evaluating expressions (like queries)- in the context of IPRScan, it provides a way of accessing a huge table of data without requiring that all the data be in memory

Pros- HDF5 provides very fast, compact and efficient indexing- NumPy provides efficient in-memory storage- Minimizes disk and memory usage- Very fast read times compared to SQLite and MySQL

Cons- Large memory overhead (particularly in comparison to smaller datasets)- Many large, complex dependencies including HDF5, NumPy, Numexpr and Cython- Slow write times (particularly important since IPRStats bottlenecks with writing)

Page 21: Friedberg bosc2010 iprstats

Multiple graph formats

Pie charts

Bar graphs

Page 22: Friedberg bosc2010 iprstats
Page 23: Friedberg bosc2010 iprstats
Page 24: Friedberg bosc2010 iprstats

Conclusions & Future

● A lightweight, machine-independent visualization tool for InterProScan annotations

● License: AFL● Todo:

● Comparative population analysis● Large dataset handling● More graphic options● Anything else you like...

– http://github.com/devrkel/IPRStats.git

Page 25: Friedberg bosc2010 iprstats

Thanks

● David Ream● Han Wang● Ian Fleming● David Vincent● Ryan Kelly● EBI● Miami University startup funding● Miami University Undergraduate Summer Scholars

Program

Page 26: Friedberg bosc2010 iprstats

The Friedberg Lab is Recruiting

● Graduate students● Postdocs● Catch me later, email me, or look at

iddo-friedberg.net to learn more