4
Case Study: A Virtual Environment for Genomic Data Visualization R. Mark Adams & Blaze Stancampiano Variagenics Inc. Michael McKenna & David Small Small Design Firm Inc. Abstract With the completion of the human genome sequence, and with the proliferation of genome-related annotation data, the need for scalable and more intuitive means for analysis becomes critical. At Variagenics and Small Design Firm, we have addressed this problem with a coherent three-dimensional space in which all data can be seen in a single context. This tool aids in integrating information at vastly divergent scales while maintaining accurate spatial and size relationships. Our visualization was successful in communicating to project teams with diverse backgrounds the magnitude and biological implication of genetic variation. CR Categories and Subject Descriptors: J.3 [Life and Medical Science]: Biology and Genetics; H.5.2 [Information Interfaces and Presentation]: User Interfaces --- Interaction styles, User-centered design; I.3.8 [Computer Graphics]: Applications. Additional Keywords: Bioinformatics, Human Factors, 3- Dimensional Interaction, Multi-scale Model, Data Navigation, Virtual Environment.1. INTRODUCTION As the data used by biologists grows, new tools are needed for manipulation and analysis, tools that are adapted to the many dimensions of biological data. This data contains information at numerous scales, ranging from atomic structure to the genetic composition of entire populations. The smooth transition between these scales is useful for enabling the discovery of the relationships between the submicroscopic and the macroscopic, essential in the field of pharmacogenetics. Variagenics has generated a database of genetic variation in thousands of genes located throughout the human genome, and after looking at the available systems for visualization of genetic data, has chosen to develop a novel system integrating the range of data relevant to our research. Existing genetic visualization systems are adapted largely for genomic purposes, specifically the discovery, categorization and analysis of novel genes. Our goal was different, namely to present genetic variation in a population focusing on genes of largely known function, and of pharmacological importance. Small Design Firm, which specializes in interactive information design had been developing a unique data browser for a Human Genome museum exhibit (Museum of Science and Industry, Chicago), and was enlisted to provide the design and implementation of a 3D interactive browser for the genetic variation data. 60 Hampshire Street, Cambridge, MA 02139-1548. [email protected] & [email protected] 875 Massachusetts Ave., Suite 11, Cambridge, MA 02139-3070. [email protected] & [email protected] 2. PROBLEM The nucleic acid sequence of the human genome contains about 3.2 billion base-pairs. Most of this sequence remains the same between individuals of the same species, but small differences between individuals do occur. In humans, single base-pair differences (or SNP, for Single Nucleotide Polymorphism) occur approximately every 1000 base-pairs between any two individual chromosomes. Most of these polymorphic sites represent normal variation, and are not usually disease-causing mutations. Although they are not responsible for disease, it is thought that some of these normally occurring genetic variants are responsible for differences in how individuals respond to drug treatment. The study of this phenomenon is called pharmacogenetics, and is the main focus of research at Variagenics. We are particularly interested in predicting individual response to the chemotherapy drugs used in the treatment of cancer. By carefully screening patients for the presence or absence of specific genetic markers for drug response, it is hoped that intelligent and biologically- meaningful chemotherapy choices can be made, allowing for significantly safer and more effective medicine. Variagenics, using high-throughput laboratory methods, has built a large database of genetic polymorphism, including data on thousands of genes. This data included information of many different types on a wide variety of scales. In order to access the information in an integrated fashion, and to present the data to individuals unaccustomed to the conventional, and quite rough, tools, a simple and intuitive system needed to be developed. The intent was to provide a means whereby individuals untrained in the use of data mining tools, but with experience in molecular biology, genetics or pharmacology, could access the information relevant to them, and place it in the context of the other dimension of data relating to pharmacogenomics. As specified, the tool had to present data at several different scales, including populations (macroscopic), chromosomal location (microscopic), pharma- cological pathways (sub-microscopic), DNA sequences (sub- microscopic/atomic) and protein structure (atomic). The tool had to be able to bring in data from a variety of outside sources, including the human genome sequence database, and the databases of 3D protein structure. XML-based file formats for input information allowed flexibility in integrating this data, and simplified processing and integration of Variagenics' in-house data with that from outside sources. 3. APPROACH To accommodate both the wide range of scale implied by the underlying data and the gene- and population-associated annotations, an interactive 3D virtual environment approach was chosen that integrated the information in an intuitive way. In a 3D dataspace, intuitive notions of location and movement can translate into access to a variety of data types in a seamless and simultaneous manner [1]. This was found to be particularly appropriate for genomic data, and the display analogies, which are highly visual in nature, found resonance with the biologists using the software. For example, the use of a 3D spatial organization

Case Study: A Virtual Environment for Genomic Data ...courses.ischool.utexas.edu/Winget_Megan... · Virtual Environment.? 1. INTRODUCTION As the data used by biologists grows, new

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Case Study: A Virtual Environment for Genomic Data ...courses.ischool.utexas.edu/Winget_Megan... · Virtual Environment.? 1. INTRODUCTION As the data used by biologists grows, new

Case Study: A Virtual Environment for Genomic Data VisualizationR. Mark Adams & Blaze Stancampiano†

Variagenics Inc.

Michael McKenna & David Small‡ Small Design Firm Inc.

Abstract With the completion of the human genome sequence, and with the proliferation of genome-related annotation data, the need for scalable and more intuitive means for analysis becomes critical. At Variagenics and Small Design Firm, we have addressed this problem with a coherent three-dimensional space in which all data can be seen in a single context. This tool aids in integrating information at vastly divergent scales while maintaining accurate spatial and size relationships. Our visualization was successful in communicating to project teams with diverse backgrounds the magnitude and biological implication of genetic variation.

CR Categories and Subject Descriptors: J.3 [Life and Medical Science]: Biology and Genetics; H.5.2 [Information Interfaces and Presentation]: User Interfaces --- Interaction styles, User-centered design; I.3.8 [Computer Graphics]: Applications.

Additional Keywords: Bioinformatics, Human Factors, 3-Dimensional Interaction, Multi-scale Model, Data Navigation, Virtual Environment.?

1. INTRODUCTION As the data used by biologists grows, new tools are needed for manipulation and analysis, tools that are adapted to the many dimensions of biological data. This data contains information at numerous scales, ranging from atomic structure to the genetic composition of entire populations. The smooth transition between these scales is useful for enabling the discovery of the relationships between the submicroscopic and the macroscopic, essential in the field of pharmacogenetics.

Variagenics has generated a database of genetic variation in thousands of genes located throughout the human genome, and after looking at the available systems for visualization of genetic data, has chosen to develop a novel system integrating the range of data relevant to our research. Existing genetic visualization systems are adapted largely for genomic purposes, specifically the discovery, categorization and analysis of novel genes. Our goal was different, namely to present genetic variation in a population focusing on genes of largely known function, and of pharmacological importance. Small Design Firm, which specializes in interactive information design had been developing a unique data browser for a Human Genome museum exhibit (Museum of Science and Industry, Chicago), and was enlisted to provide the design and implementation of a 3D interactive browser for the genetic variation data.

† 60 Hampshire Street, Cambridge, MA 02139-1548.

[email protected] & [email protected] ‡ 875 Massachusetts Ave., Suite 11, Cambridge, MA 02139-3070.

[email protected] & [email protected]

2. PROBLEM The nucleic acid sequence of the human genome contains about 3.2 billion base-pairs. Most of this sequence remains the same between individuals of the same species, but small differences between individuals do occur. In humans, single base-pair differences (or SNP, for Single Nucleotide Polymorphism) occur approximately every 1000 base-pairs between any two individual chromosomes. Most of these polymorphic sites represent normal variation, and are not usually disease-causing mutations. Although they are not responsible for disease, it is thought that some of these normally occurring genetic variants are responsible for differences in how individuals respond to drug treatment. The study of this phenomenon is called pharmacogenetics, and is the main focus of research at Variagenics. We are particularly interested in predicting individual response to the chemotherapy drugs used in the treatment of cancer. By carefully screening patients for the presence or absence of specific genetic markers for drug response, it is hoped that intelligent and biologically-meaningful chemotherapy choices can be made, allowing for significantly safer and more effective medicine.

Variagenics, using high-throughput laboratory methods, has built a large database of genetic polymorphism, including data on thousands of genes. This data included information of many different types on a wide variety of scales. In order to access the information in an integrated fashion, and to present the data to individuals unaccustomed to the conventional, and quite rough, tools, a simple and intuitive system needed to be developed. The intent was to provide a means whereby individuals untrained in the use of data mining tools, but with experience in molecular biology, genetics or pharmacology, could access the information relevant to them, and place it in the context of the other dimension of data relating to pharmacogenomics. As specified, the tool had to present data at several different scales, including populations (macroscopic), chromosomal location (microscopic), pharma-cological pathways (sub-microscopic), DNA sequences (sub-microscopic/atomic) and protein structure (atomic).

The tool had to be able to bring in data from a variety of outside sources, including the human genome sequence database, and the databases of 3D protein structure. XML-based file formats for input information allowed flexibility in integrating this data, and simplified processing and integration of Variagenics' in-house data with that from outside sources.

3. APPROACH To accommodate both the wide range of scale implied by the underlying data and the gene- and population-associated annotations, an interactive 3D virtual environment approach was chosen that integrated the information in an intuitive way. In a 3D dataspace, intuitive notions of location and movement can translate into access to a variety of data types in a seamless and simultaneous manner [1]. This was found to be particularly appropriate for genomic data, and the display analogies, which are highly visual in nature, found resonance with the biologists using the software. For example, the use of a 3D spatial organization

Melanie K Tory
Melanie Tory
513
Page 2: Case Study: A Virtual Environment for Genomic Data ...courses.ischool.utexas.edu/Winget_Megan... · Virtual Environment.? 1. INTRODUCTION As the data used by biologists grows, new

allowed the direct integration of features at a variety of scales into a single modeless presentation. Protein structures are represented using conventional methods, as are chromosome pairs. Since the user must zoom into the visualization to see the protein models, some impression of the relative scale relationship is preserved (see Figures 1-8).

In building the tool, we decided to forgo the conventional click-and-drag paradigm in favor of the simple, continuous and 3D navigation afforded by a joystick. The user is given the ability to "fly" into the visualization, turning his/her point of view towards regions of interest, or bringing forth previously hidden realms of information from behind the currently visible plane. This control fits in well with the continuous space implied by the data environment, and presents the user with a reference for the scaling transitions experienced.

In a similar way smooth scaling accompanying spatial transition allows very fine-grained data, such as DNA sequences to be displayed alongside the protein and chromosomal location data, without requiring the mode change common to most genome visualization systems. The intuitive nature of this transition is particularly interesting, given the nonrepresentational nature of typographic renderings of DNA sequences. Scale and location are constants in the space, allowing the relationships between the scales to be better understood. This works even though the scale relationships are not perfectly linear; in general the scaling is accurate, although not strictly exact.

By virtue of the wide range of implied scales, the framework also provides a context in which to place new information. Since the system already spans the scale of most meaningful biological processes, from the atomic to the population, it is likely that the visualization tool will scale appropriately.

The application was developed using sdfWindows, a proprietary object-oriented graphics package, written in C++, that provides support for a variety of rendered objects, including high-quality, anti-aliased, scalable 3-dimensional typography. The package also supports level-of-detail rendering and animation procedures, based on environmental parameters and the real-time clock or specified frame durations. These capabilities, combined with the use of transparency and texture mapping, with smooth animation in real-time can create compelling environments [2].

SdfWindows is built on OpenGL, which it uses for image output as well as keyboard, mouse and joystick input. A minimal Microsoft Windows layer is used to house the application.

This genetics application is PC-based, using mid-to-high end off-the-shelf hardware. The choice of the graphics card can have a significant impact on graphics frame rates, but a variety of recent cards give excellent and cost-effective performance, such as the NVIDIA GeForce2 or later.

The system was designed with a minimum of controls, to streamline interaction and allow many types of users to easily drive the application. A joystick (or gamepad) is used to control the virtual camera – up and down control the zoom in and out, and side to side motions swing the camera view horizontally.

Optionally, the joystick slider can be used to set a zoom in/out rate. Three joystick buttons are used to select target genes from the list of genes associated with a given pathway of drug action. Alternately, the keyboard can be used to substitute for the joystick, using the four arrow keys for camera control, and three keys for gene selection.

Figure 1: Pathway diagram with gene selection.

Figure 2: Distant view of all chromosomes.

Figure 3: Approaching chromosome 1.

Figure 4: Approaching the target gene.

Melanie Tory
514
Page 3: Case Study: A Virtual Environment for Genomic Data ...courses.ischool.utexas.edu/Winget_Megan... · Virtual Environment.? 1. INTRODUCTION As the data used by biologists grows, new

The selected gene forms the focal point for the camera zoom. “Zooming” is actually the result of camera translations forwards and backwards using a logarithmic scale. The camera distance from the target determines the currently displayed content (in terms of level of detail and similar object control), and is used to trigger animated transitions. Although the system is modeless, as the camera passes from one environmental scale into another, smoothly varying changes are made, such as rotating the camera orientation or modifying object properties in order to better display the data being revealed at the new scale domain.

A set of XML files is generated from Variagenics’ master genetics database. The application uses the XML data at startup to generate the virtual environment content. The XML files include information on chromosomes, genes, cDNA sequences, proteins, nucleotide variances, variance-related observations, etc. Protein model structures are defined using the POV format for ease of file parsing.

In addition to the virtual environment application, a companion application was developed that uses the same XML input data to format high-quality PostScript output for 2D print and Flash display.

4. RESULTS The virtual environment genetics application allows the user to travel from the level of the entire genome down to the molecular scale of proteins and nucleotide values, and the variations within them, as can be seen in the application still-frames in Figures 1-8. Another form of visualization is shown in Figures 9-12, where an example of loss of heterozygosity (LOH) is depicted. LOH is a type of genetic variation that occurs frequently in cancer, in which a region has been deleted from one of a pair of chromosomes. The figures will be described in more detail below. The system is also shown in operation in the accompanying DVD video.

The application operates at interactive speeds on desktop PC computers. Specifically, on a Dell Dimension 8200 1.9 GHz P4, with an NVIDIA GeForce3 graphics card, frame rates of 8-25 fps were obtained, independent of the image resolution, ranging up to 1600x1200. Frames rates of 5-18 fps were obtained on a Dell Dimension XPS B1000 1 GHz P3 with an NVIDIA GeForce2 Ultra card.

The user begins at the outermost scale, and selects a target gene from the list associated with the biochemical pathway being studied. Figure 1 shows the pathways that impact the activity of the chemotherapy drug 5-FU (5-fluorouracil), and the gene that encodes the protein MTHFR (methylenetetrahydrofolate reductase) is selected. Figure 2 shows the collection of 23 human chromosome pairs. Moving the camera inward, the user approaches the target on chromosome 1, and bars representing genes of interest are revealed in Figure 3. As the target gene nears in Figure 4, the camera rolls to a new orientation, and the banded chromosome object flattens to better show the gene bars. An indicator for the MTHFR-encoding gene becomes resolved in Figure 5. The user moves up to the gene and sees the MTHFR protein structure, with the 3D location of two single nucleotide variances indicated in Figure 6. Moving the camera in further also pans the camera left, where nucleotide sequences with variances are laid out along the gene in Figure 7. At this scale, panning the camera back and forth and up and down reveals more of the nucleotide sequence. Panning further to the left, the camera transitions “around the corner” to reveal a table of the distribution of genetic polymorphism within a sample population, shown in

Figure 5: Nearing the gene that encodes MTHFR.

Figure 6: The MTHFR protein structure.

Figure 7: To the left of the protein are nucleotides.

Figure 8: Population distribution of genetic polymorphism.

Melanie Tory
515
Page 4: Case Study: A Virtual Environment for Genomic Data ...courses.ischool.utexas.edu/Winget_Megan... · Virtual Environment.? 1. INTRODUCTION As the data used by biologists grows, new

Figure 8. The user can quickly move back and forth between these different levels of scale, maintaining context.

Figures 9-12 show a zoom into the “left” chromosome 1, where deleted regions are indicated. These examples are taken from cancer patients who have experienced LOH within their tumor cells. The other chromosome of the pair remains the sole means of protein production from genes in the deleted regions. As the camera moves in, successively smaller deleted regions are illustrated. Even in the smallest deletion, in Figure 12, the gene encoding MTHFR is missing, which has strong implications for how patients with this genetic variation respond to 5-FU.

We have found that analog-style joystick inputs can give the perception of faster frame rates as compared to keyboard controls for the virtual camera. As the camera gently drifts from small joystick or slider positions, every frame gives subtle but meaningful motion cues. The ability to easily set the camera motion rate also gives the feel of greater control and responsiveness.

In presentations to audiences with widely varying levels of expertise, the system was found to be intuitive and usable. Most users reported that the system enabled them to understand the scale of the human genome, and the nature of genetic polymorphism as they never had before, leading to real understanding of complex underlying issues around the pharmacogenomics of cancer. Additionally, most users found the tool to be not only informative, but also engaging. At a scientific meeting where a prototype was set up for general use, there were lines of scientists waiting to “play” with the system. This demonstrates that the system not only engages the intellect, but that it does so in a thoughtful and interesting manner.

5. ACKNOWLEDGMENTS Thanks to Lisa Strausfeld for design work on the tabular data, to Justin Manor for development work on the application, to Daniel Chasman for assistance with the 3D protein models, and to David Tefft for XML processing work.

References [1] David Small, Shigeru Ishizaki, and Muriel Cooper.

“Typographic Space.” In Human Factors in Computing Systems, CHI ’94 Conference Companion, pages 437-438. ACM, 1994.

[2] David Small. “Rethinking the Book.” In Gunnar Swanson, editor, Graphics Design & Reading, pages 189-201. Allworth Press, 2000. ISBN 1-58115-063-6.

Figure 9: Large deleted chromosome range.

Figure 10: Closer in, we see a slightly smaller deleted range.

Figure 11: Closer in is a smaller deleted range.

Figure 12: This deleted range contained the MTHFR gene.

Melanie Tory
516