53
Bacterial Gene Neighborhood Investigation Environment: A Scalable Genome Visualization for Big Displays Jillian Aurisano Master of Science Defense April 16, 2014

Jillian ms defense-4-14-14-ja

Embed Size (px)

Citation preview

  • 1.Bacterial Gene Neighborhood Investigation Environment: A Scalable Genome Visualization for Big Displays Jillian Aurisano Master of Science Defense April 16, 2014

2. Science has historically looked like this: 3. Up until very recently Observations! Expertise explore, make observations Collect samples 4. No one looks under a microscope anymore. Its all DNA. How do scientists make discoveries? 5. How do we bring experts into the loop? From direct collection of data, direct observation of results direct interpretation and analysis To automated data collection, automated filtering and automated analysis Need visualization to bring experts into the loop But how do we handle big data? Whats our Big Data microscope? Picard: Computer; scan everything, run diagnostics, and tell us the answer. Computer: Results are inconclusive 6. Can Big Displays help? Evidence suggests that these environments can have a positive impact on perception and cognition But how do we use them to effectively address big data problems? Can existing visualizations simply be scaled- up to fit or are new approaches needed? 7. In this thesis I will Examine a specific big data visualization problem: comparative gene neighborhood analysis in bacterial genomics I worked closely over several years with a team of computational biologists This work has led to the design and implementation of a new visualization approach designed to scale to big data and big displays BactoGeNIE (Bact(o)erial Gene Neighborhood Investigation Environment) 8. Outline 1) Describe comparative bacterial gene neighborhood analysis to understand how to bring experts into the loop 2) Examine potential impact of Big Displays on Big Data visualization 3) Evaluate scalability in existing comparative genomics visualizations My work: BactoGeNIE 4/5/6) Describe my design, implementation, results 7) Think about the future In the process, learn something about scaling up visual approaches to big data and big displays 9. Warning: Biology is used in this thesis! 10. Genome sequencing boom Sequencing costs decreasing faster than Moores Law So, we are able to produce massive volumes of sequence data Bacterial genomes are small, so we are generating thousands of complete bacterial genome sequences Wetterstrand K.A., DNA Sequencing Costs: Data from the NHGRI Large- Scale Genome Sequencing Program, 2012 11. What is a genome? What is a gene? Genomes consists of one or more long molecules of DNA DNA consists of chained nucleotide molecules (A, C, T, G) also called base pairs All the genes in an organism are in its genome Genes determine traits in an organism Genes code for proteins, and proteins do the work to make traits happen 12. How are genomes sequenced? Sequencing Assembly Annotation Output: Genome feature files Raw sequence files Michael Schatz Cold Spring Harbor 13. Lots of genome sequences-> opportunity Big challenge: Hard to figure out what a novel gene does Traditionally: do wet-lab research to figure out but expensive, time-consuming Sequence the gene, and use computational methods to predict the function of the protein If novel gene, may not provide answer Can complete genome sequences help? Comparative gene neighborhood analysis 14. From genome structure to gene-product function In bacteria, genes whose products are involved in similar functions often placed close to each other in the genome. Research suggests that it is possible to predict gene-product function in bacteria based on commonly recurring gene neighbors But, need to examine lots of genomes for statistical significance? gene1 gene2 gene3 gene4 Biological process ? 15. Comparing gene neighborhoods across different genomes Genes with similar sequences likely produce proteins with similar functions Orthologs: similar genes from different genomes Algorithms to compare genes between different genomes DeMeo et al. BMC Molecular Biology 2008 9:2 doi:10.1186/1471-2199-9-2 16. Role for visualization in this problem Why not use automated methods to find common sets of genes around gene targets? Why visualization? 3 Es: Exploration, Expertise, Errors 17. Patterns and anomalies without knowing in advance what you are looking for Exploration Automated methods: Target: gene B Common subsequences: Strains 1, 2, 3: {A, B, C, D} Duplication Strain 1 Strain 2 Strain 3 A B D A A C CC D D B C CBB B Truncation Strain 1 Strain 2 Strain 3 A B C D A A B C D D B C Deletion Strain 1 Strain 2 Strain 3 A B C D A A C D D B B Inversion Strain 1 Strain 2 Strain 3 A B C D A A B C D D CB 18. Expertise Experts make connections that will be missed by automated methods Not just the anomaly, but significance of the anomaly Knowledge about strains, protein families involved in finding significant anomalies StrainA StrainB StrainC ! 19. Errors Verify automated methods Uncertainty and errors in data generation Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, D} Ground truth Strain 1 Strain 2 Strain 3 A B C D A B C D A A B C D D A A B C D D Data Strain 1 Strain 2 Strain 3 Automated methods: Common subsequences: Strains 1 and 3: {A, B, C, D} Strain 2: {A, B} Ground truth Strain 1 Strain 2 Strain 3 Strain 2 A B C Breaks in assembly Missed gene boundaries 20. To address this problem: Visualization must help bring experts into the data mining loop 1) Helps experts identify sources of error 2) Allows experts explore the data 3) Enable researchers to integrate expertise in data analysis So: overview visualization not enough. Need gene-neighborhood details Visualization must scale to enable comparisons between hundreds to thousands of genomes 21. Big displays: Opportunity for big data? The question is: can these environments be used to visualize big data sets better? Evidence suggests yes: Physical navigation over virtual navigation Reduced need pan and zoom Reduced need for context switching Utilize embodied cognition Multiple levels-of detail accessible through physical movement Externalize more information that can be accessed simultaneously Lance Long 22. Porting from small to big displays Maybe porting genome visualizations to these environments is sufficient? Ruddle2013: Export high-resolution graphical output from existing genomics visualizations Display these large images on big display Evidence that this had a positive impact on researcher reasoning However, effective visualization on big displays involves more than simply scaling up the representation 23. Pixel-Density Scalability As pixel-density increases, does a visual approach take advantage of increased pixels-per-inch to show more entities, relationships or to show data at higher detail Evaluation: High-Density Representation? use increased pixels per inch to show more entities and relationships at higher detail? Simultaneous detail and overview? With increased pixel density, representation shows details and overviews at the same time, without relying on Focus+Context 24. Display-Size Scalability As display size increases, does a visual approach take advantage of the increased space to depict more entities or relationships? Evaluation Encode big data spatially Cluster related elements: spatial memory direct, visual comparisons Physical navigation over virtual navigation: Overviews at a distance, details up-close 25. Perceptual and Analytic Task Scalability Does a visual approach scale up to enable the performance of an analytic task across more data, more space, more pixels. Does perception suffer if you scale the approach up? Analytic tasks performed pre-attentively Analytic tasks aided by visual queries Aids to visual search for performing analytic tasks 26. Examining current genomic data visualizations Does it address this problem? Show gene neighborhoods Comparative Does this visualization allow comparison between more than a few gene neighborhoods? If you scale the visual approach up, does it: Allow more comparisons of gene neighborhoods (Analytic Task Scalability) Take advantage of big displays in size and pixel-density (Display Resolution Scalability and Display Size Scalability) In the process, remain sensible to a human viewer (Perceptual scalability) 27. Line-based comparative approaches On load, align 1-2 genes to a chosen gene in a reference genome Draw a line or a band to connect orthologs In many cases, repurpose genome browsers to be comparative by adding comparative track Tools: PSAT, GBrowse_syn, SynView, ACT, CGAT, Combo, MizBee, Mauve Pan, X. et al. (2005). SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatics (Oxford, England). McKay et al. Using the Generic Synteny Browser (GBrowse_syn). Current protocols in Bioinformatics Hoboken, NJ, USA: John Wiley & Sons 28. Line-based approaches expanded: Mauve Like parallel coordinates Draw lines between orthologs Color genes by their block with that genome (not colored by orthology) Example shows 9 genomes Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-140 29. Line-based approaches: Critique Pixel-density scalable? Not a high-density representation Need space for the comparative track Display size scalable? Hard to follow lines across a display Hard to compare similar neighborhoods across the display No overview from a distance, details up close Perceptual scalability for comparing gene neighborhoods? Lots of visual clutter Comparisons not pre-attentive No aid to visual search Number of genomes Published up to 9 Private groups have adapted frameworks for 10-50 genomes on big display Darling, Aaron CE, et al. "Mauve: multiple alignment of conserved genomic sequence with rearrangements." Genome research 14.7 (2004): 1394-140 30. PSAT: Color and alignment PSAT Orthologs encoded using color Strand on which gene is positioned is encoded by orientation to the center line Text is given by default Fong, Christine, et al. "PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes." BMC bioinformatics 9.1 (2008): 170. 31. PSAT: Critique Pixel-Density Scalability Not high-density representation because of text labels Perceptual scalability for comparing gene neighborhoods? Cant scale to large number of genes- not enough colors Fong, Christine, et al. "PSAT: a web tool to compare genomic neighborhoods of multiple prokaryotic genomes." BMC bioinformatics 9.1 (2008): 170. 32. GeneRiViT: Alignment and color GeneRiViT Align against arbitrary gene Color by presence/absence Examples show 4 genomes Critique: No discussion of scalability Overview visualization Doesnt address our problem Price, A. et al "Gene-RiViT: A visualization tool for comparative analysis of gene neighborhoods in prokaryotes." Biological Data Visualization (BioVis), 2012 IEEE Symposium on. IEEE, 2012. 33. Dot plots Coordinates of genes in two genomes are used as x and y axis Orthologous genes in other genomes are plotted Each genome given a unique color Critique: Doesnt provide gene- neighborhood view Overview tool Hard to follow beyond a few genomes Price, A. et al "Gene-RiViT: A visualization tool for comparative analysis of gene neighborhoods in prokaryotes." Biological Data Visualization (BioVis), 2012 IEEE Symposium on. IEEE, 2012. 34. Overview Visualizaiton: Sequence Surveyor Not this domain problem, but interesting approach Each gene is drawn as a rectangle Several possible variables for position: Ordinal position Several possible variables for color: Position in one reference genome Use a color ramp, for wide range of colors Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualization." Visualization and Computer Graphics, IEEE Transactions on 17.12 (2011): 2392-2401. 35. Overview Visualizaiton: Sequence Surveyor Pixel-density scalable High-density representation High-detail representation Display size scalability May be difficult to compare patterns from one side of display to another Perceptual Scalability Colors allow for pre-attentive identification of patterns Avoids visual clutter Albers,D. et al "Sequence surveyor: Leveraging overview for scalable genomic alignment visualization." Visualization and Computer Graphics, IEEE Transactions on 17.12 (2011): 2392-2401. 36. Copy number variations on big displays Orchestral: Visualization of a different data type Effective use of color to enable pre-attentively identification of similarities across genomes High-density representation Details-up-close, overview from a distance Ruddle, Roy A., et al. "Leveraging wall-sized high-resolution displays for comparative genomics analyses of copy number variation." Biological Data Visualization (BioVis), 2013 IEEE Symposium on. IEEE, 2013. 37. BactoGeNIE Demo 38. Program details Implemented in C++ using Qt and the QGraphicsView framework Upload: genome feature files Fasta files (raw gene sequences) Cd-hit algorithm processes sequence files to compute ortholog clusters MySQL database to store big datasets Loads 1000 contigs into memory, rest stored in database Optimized for PubMed datasets Prototyped on E.Coli draft genomes Capable of displaying any contigs from thousands of E.Coli draft genomes On EVL Cyber-commons wall, around 400 contigs in view 39. BactoGeNIE: High density representation Compressed genome encoding No text labels, instead on-demand No comparative track Encode orthology using User applied color: pre- attentive orthology identification Coordinated highlighting: scalable visual query Alignment: use space to encode similarity 40. Use space to encode similarity Goals: Make it easier to perform comparisons across many genomes (Analytic task scalability) Accommodate increased display size (Display Size Scalability) Make similarities and differences easy to see (Perceptual Scalability) Sorting and Alignment Sort by contig length Sort by gene content Dynamically align against any gene 41. Interactivity On hovering, contig expands in height, so easier to select genes of interest in high-density view Pop-up menu for each gene that gives info and allows for: application of color: tagging operation Scalable query targeting operation (described next) User can sort genomes by : Gene target Contig length 42. Gene Targeting Function to create high resolution, comparative maps User selects a gene of interest This gene is given a base color Two color ramps are applied to adjacent genes, one upstream and one downstream Orthologous genes in related genomes are given the same colors Contigs containing this gene are brought to the top The target gene is centered Orthologs are aligned to the target 43. Gene targeting function Clustering to promote direct comparisons Overviews at a distance Details up close Pre-attentive identification of similarities and differences between gene neighborhoods Lance Long 44. Examples 45. Pixel-density Scalability BactoGeNIE fits the pixel-density scalability criteria: High-density data display, identifier display and orthology encoding 46. Display Size Scalability BactoGeNIE is the only approach to use clustering and show multiple levels of detail 47. Perceptual Scalability and Analytic Tasks BactoGeNIE: Similarity is pre- attentively accessible Avoids visual clutter Visual query for orthologs 48. Graphical Scalability: Display Resolution vs Number of Genomes 0 100 200 300 400 500 600 700 800 900 1000 480 720 1080 1440 2160 2880 3240 4320 BactoGeNIE GeneRiViT SynBrowse SynView PSAT Geco Mauve Pixels Genomes 49. Preliminary User Feedback A version of BactoGeNIE used by computational biology team on NxN pixels and MxM inches resolution tiled display wall This tool has been widely used by members of the team to show the comparative analyses of genomic context for several bacterial genomes Genome browsers such as JBrowse enable researchers to do comparative genome analyses for nearly 10-50 genomes. But fail to work when we are studying several hundreds of genomes of interest. This tool is really unique and its the only tool that I am aware of that can scale up to any number of genome comparisons. The ability to load multiple tracks of genomes, and the zoom in and out options with color coding, annotation tracks makes it very convenient for scientists to quickly look at patterns. This tool has a potential to serve both for visualization as well as data mining needs. Usage of a version without the gene targeting approach. Future study will concentrate on this feature with a wider community of users 50. Summary of contributions A novel design that is the first to enable direct comparisons between hundreds of gene neighborhoods in one view First interactive, large-scale comparative gene neighborhood approach, with on-the-fly sorting, dynamic alignment, user-selected color and color ramps First to show overviews with gene neighborhood- details, that can be accessed through physical movement introduces a novel visualization approach gene targeting that translates genomic data into high- resolution genomic maps 51. Whats next? Design Integration with different levels of detail Multiple color ramps Advanced ordering in y, based on similarity to target or strain phylogeny Implementation Scalability in rendering using parallelization on the GPU Port to SAGE Evaluation User studies and evaluations of perceptual scalability 52. Scalable Design, Big Data, Big Displays Need visualization to provide an interface between automated analysis and the expert Porting existing visual approaches to big data and big displays will not always work Need to design for increased pixel-density display size volume of analytical tasks 53. Thanks! Acknowledgements: Jason Leigh, Andy Johnson, Khairi Reda, Lance Long, Uthman Shabazz, and everyone in the Electronic Visualization Laboratory Barry Goldman, David Bush, Niran Iyer, Shawn Stricklin and the rest of the computational biology team at Monsanto