Gautier bosc2010 pythonbioconductor

Embed Size (px)

Text of Gautier bosc2010 pythonbioconductor

  • Bioconductor with Python, What else ?ISMB / BOSC

    Laurent Gautier [laurent@cbs.dtu.dk]

    DMAC / CBS

    July 10th, 2010

    1 / 20

  • Disclaimer This is not about the comparative merits of scripting

    languages This is about being able to access natively libraries

    implemented in a different language

    2 / 20

  • About Bioconductor

    Set of open-source packages for R Started circa 2002 with a focus on microarrays Rooted in statistics, data analyis, and visualization Several hundred packages, addresses NGS, HTS, flow

    cytometry, protein-protein interactions, . . . Biannual releases Presence on the publication circuit ( > 2, 300 citations for

    the BioC publication, > 600 for limma, > 500 for affy )

    3 / 20

  • About Python

    Simple and clear all-purpose scripting language Sometimes used in introductions to programming Popular for agile development Bioinformatics libraries:

    biopython (libraries for bioinformatics) galaxy (web front-end to pipelines) PyCogent, pygr, bx-python (biological sequences-oriented)

    Large selection of libraries: Web development: Zope, Django, Google App Engine Scientific computing: Scipy / Numpy Cloud computing: Disco, execnet Interface with C: ctypes, Cython

    4 / 20

  • A view on R/bioconductor and Python in bioinformatics

    Bioinformaticsdata

    Automation

    Storage /Retrieval

    SamplesMicroarray

    NGS

    Annotation

    Flow-cytometry,

    proteomics,other

    assays. . .

    R/BioconductorStatisticalanalysis

    Visualization

    Interactiveprogram-

    ming

    Python

    Non-interactive

    abilitiesData

    storage /retrieval

    Web

    Algorithmdevelopment

    Scientificcomputing

    Python is an all-purpose scriptinglanguage.

    Communities

    ComputerScientists

    Physicists

    Biologists

    Statisticians

    5 / 20

  • Bioinformaticsdata

    Automation

    Storage /Retrieval

    SamplesMicroarray

    NGS

    Annotation

    Flow-cytometry,

    proteomics,other

    assays. . .

    R/BioconductorStatisticalanalysis

    Visualization

    Interactiveprogram-

    ming

    Python

    Non-interactive

    abilitiesData

    storage /retrieval

    Web

    Algorithmdevelopment

    Scientificcomputing

    Python is an all-purpose scriptinglanguage.

    Communities

    ComputerScientists

    Physicists

    Biologists

    Statisticians

  • Bioinformaticsdata

    Automation

    Storage /Retrieval

    SamplesMicroarray

    NGS

    Annotation

    Flow-cytometry,

    proteomics,other

    assays. . .

    R/BioconductorStatisticalanalysis

    Visualization

    Interactiveprogram-

    ming

    Python

    Non-interactive

    abilitiesData

    storage /retrieval

    Web

    Algorithmdevelopment

    Scientificcomputing

    Python is an all-purpose scriptinglanguage.

    Communities

    ComputerScientists

    Physicists

    Biologists

    Statisticians

  • Running R code from Python (an example)AimRunning edgeR from Python

    MethodRobinson MD, McCarthy DJ and Smyth GK (2010). edgeR:a Bioconductor package for differential expression analysisof digital gene expression data. Bioinformatics 26, 139-140

    DataControl Treated

    lane1 lane2 lane3 lane4 lane5 lane6 lane8ENSG00000230758 0 0 1 0 0 0 0ENSG00000182463 0 2 4 1 5 5 0ENSG00000124208 82 124 102 136 90 120 40ENSG00000230753 0 0 0 3 0 0 0ENSG00000224628 7 8 8 18 8 7 1ENSG00000125835 138 209 227 295 281 220 54ENSG00000125834 25 31 48 56 67 61 15ENSG00000197818 17 27 16 26 41 39 9ENSG00000243473 0 0 0 2 0 0 0ENSG00000226325 0 0 2 0 3 1 0

    . . . . . . . . . . . . . . . . . . . . . . . .

    7 / 20

  • from rpy2.robjects.packages import importrfrom bioc import edger

    base = importr(base)

    summarized = edger.DGEList.new(counts = counts,lib_size = base.colSums(counts),group = grp)

    disp = edger.estimateCommonDisp(summarized)

    tested = edger.exactTest(disp)

    results = edger.topTags(tested)

    logConc logFC PValue FDRENSG00000127954 -31.03 37.97 0.00 0.00ENSG00000151503 -12.96 5.40 0.00 0.00ENSG00000096060 -11.78 4.90 0.00 0.00ENSG00000091879 -15.36 5.77 0.00 0.00ENSG00000132437 -14.15 -5.90 0.00 0.00ENSG00000166451 -12.62 4.57 0.00 0.00ENSG00000131016 -14.80 5.27 0.00 0.00ENSG00000163492 -17.28 7.30 0.00 0.00ENSG00000113594 -12.25 4.05 0.00 0.00ENSG00000116285 -13.02 4.11 0.00 0.00

    8 / 20

  • R code / Python codelibrary(edgeR)summarized

  • Bioconductor library IRanges

    10 / 20

  • Bioconductor library Biostrings

    11 / 20

  • Separate communities

    12 / 20

  • Bilingual community

    13 / 20

  • Interpreters/Translators

    14 / 20

  • Cost of translation

    R package Python modulelines of code

    AnnotationDbi 168 annotationdbi.pyBiobase 341 biobase.pyBiostrings 591 biostrings.pyBSgenome 112 bsgenome.pyedgeR 107 edger.pyGEOquery 102 geoquery.pyGGbase 104 ggbase.pyGGtools 77 ggtools.pygoseq 43 goseq.pyGSEABase 149 gseabase.pyIRanges 295 iranges.pyShortRead 301 shortread.py

    15 / 20

  • R within Python R is running as embedded into Python R objects remain in the R workspace, but can be accessed

    from Python Python-level shells to access the R objects The rpy2 package is used to achieve so

    biostrings = importr(Biostrings)class AAString(XString):

    _aastring_constructor = biostrings.AAString

    @classmethoddef new(cls, x):

    """ :param x: a string of amino-acids """res = cls(cls._aastring_constructor(conversion.py2ri(x)))_setExtractDelegators(res)return res

    aas = AAString("PROTEIN")

    16 / 20

  • What is needed to continue

    More interpreters/translators Many bioconductor packages. Keep up-to-date existing translations.

    Keeping up-to-date Frequent API-breaking changes in bioconductor Taylored interfaces increase maintenance Meta-programming and reflexivity can alleviate this

    17 / 20

  • Example with meta-programming:

    class AssayData(rpy2.robjects.methods.RS4):""" Abstract class. That class in a ClassUnionRepresentationin R, that a is way to create a parent class for existingclasses. This is currently not modelled in Python. """__rname__ = AssayData__metaclass__ = rpy2.robjects.methods.RS4_Type

    __accessors__ = ((featureNames, Biobase, featurenames,True, maps Biobase::featureNames),(sampleNames, Biobase, samplenames,True, maps Biobase::samplenames),(storageMode, Biobase, storagemode,True, maps Biobase::storageMode))

    18 / 20

  • Example of a complete applicationA web-server to run EdgeR.

    from bottle import route, runfrom my_edger import get_toptags, make_results_page@route(/)def index():

    return

    @route(/edger, method=POST)def run_edger():

    data = request.files.get(data)if data:

    counts, grp = read_count_data(data.file.name)top_tags = get_toptags(counts, grp)return make_result_page(top_tags)

    else:abort(404, "Invalid count file.")

    run(host=localhost, port=8080)

    19 / 20

  • Acknowledgements Users, and communities from R, Bioconductor, Python,

    Biopython (Vincent Davis, Nicolas Rapin, Brad Chapman)

    URLshttp://pypi.python.org/pypi/rpy2-bioconductor-extensions/

    http://bitbucket.org/lgautier/rpy2-bioc-extensions

    http://packages.python.org/rpy2-bioconductor-extensions/ http://rpy2.sourceforge.net/

    20 / 20

    http://pypi.python.org/pypi/rpy2-bioconductor-extensions/http://bitbucket.org/lgautier/rpy2-bioc-extensionshttp://packages.python.org/rpy2-bioconductor-extensions/http://rpy2.sourceforge.net/
  • 21 / 20