126

Exome Sequence Analysis and Interpretation

Embed Size (px)

DESCRIPTION

A concise handbook for clinicians and clinical geneticists.About the Authors:Sridhar SivasubbuSridhar Sivasubbu’s laboratory is interested in exploiting the advantages of zebrafish to dissect molecular mechanisms of gene function, regulation and genome organization in vertebrates. Research activities in his lab include deciphering non-coding RNA mediated regulation of blood and blood vessel development and development of zebrafish models for application in personalized and precision medicine in humans. His group is actively involved in mapping the genome and transcriptome of the wild zebrafish. His group was also responsible for the whole genome sequencing of human samples from India and other Southeast Asian countries.Sridhar did his PhD from M.S University, Tirunelveli, India and postdoctoral research at the Center for Cellular and Molecular Biology, India and the University of Minnesota, USA. He is a faculty at the CSIR-Institute of Genomics & Integrative Biology (CSIR-IGIB) since 2006. Sridhar also served as the CEO of The Center for Genomic Application, a Public-Private partnership company established by CSIR-IGIB for enabling research in the field of Genomics and Proteomics, where he spearheaded the application of next generation sequencing technology for commercial projectsVinod ScariaVinod Scaria is a clinician turned computational biologist. His laboratory is interested in understanding the function, organization and regulation of vertebrate genome, and how genomic variations could potentially impact them. He is also involved in creating novel methods and resources for analysis and annotation of genomes and understanding the functional impact of genomic variations. He has been part of collaborative genomics projects aimed at understanding the Asian Genome diversity. He has also been part of the whole genome sequencing and analysis projects including the Indian, Sri-Lankan and Malaysian genome projects and is also a member of the HUGO Pan-Asian Population Genomics Initiative task-force. He has adopted novel and creative strategies, such as the use of social media, and the participation of a large number of undergraduate students in collaborative projects to accelerate genome annotation and co-creation resources for genome annotation.Vinod did his undergraduate medical education from Calicut Medical College, University of Calicut and PhD in Computational biology from University of Pune. Vinod has over 80 peer publications in international peer-reviewed journals and two book-chapters to his credit. He is also in the editorial board of PLoS ONE, PeerJ, Journal of Translational Medicine and Journal of Orthopaedics (Elsevier). He is also recipient of the CSIR Young Scientist Award for Biological Sciences in 2012. He was a member in the senate of the Academy of Scientific and Innovative Research (AcSIR)

Citation preview

Page 1: Exome Sequence Analysis and Interpretation
Page 2: Exome Sequence Analysis and Interpretation

Exome sequence analysis and interpretation

Handbook for Clinicians

1st Edition ________

Vinod Scaria Sridhar Sivasubbu

Page 3: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

2

Page 4: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

3

Like us on Facebook https://www.facebook.com/clinicalexome

1st Edition (2015) Version 1.01 Scaria V and Sivasubbu S Exome sequence analysis and interpretation The entire surplus from the sale of this book in will go to support advancing research in genomics.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Cover Image: Artist’s impression of Nucleotides in a DNA strand. Oil on canvas by Pradha (2015)

Page 5: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

4

Page 6: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

5

Acknowledgements

A number of individuals have contributed to this book

in personal as well as professional capacities. This include graduate students from our groups, especially Mr. Shamsudheen Karuthedath Vellarikkal, Mr. Rijith Jayarajan, Mr. Ankit Verma, Ms. Saakshi Jalali, Ms. Heena Dhiman and Mr. Kandarp Joshi who have helped in collating content, and figures which enrich the manuscript. Authors also thank and acknowledge critical comments, editorial help and support from our colleagues, Dr. Vamsi Krishna, Dr. Adita Joshi, Dr. Srinivasan Ramachandran, Dr. Jameel Ahmad Khan and Dr. Abhay Sharma.

Authors thank the Genomics for Understanding Rare

Diseases- India Alliance network (GUaRDiAN) and collaborators for critical insights, which significantly enriched the outlook and content of this book. Authors thank an innumerable number of patients and families who have interacted with us through the network, without which our insights and knowledge would have been limited.

The authors acknowledge the financial support from

the Council of Scientific and Industrial Research, (CSIR), India through grant BSC0212 (Wellness Genomics Project). The funding agencies had no role in the preparation of the content or the decision in publishing this book. Authors declare no competing financial interests.

Page 7: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

6

Page 8: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

7

Dedication

Dedicated to the innumerable patients and families who

enriched our knowledge and insight through their close interactions, shared their distress like a family member, contributed samples to research selflessly, without which we would not have been what we are, and we would not be doing what we do, and would not be writing what we wrote.

Page 9: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

8

Page 10: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

9

Contents

Contents ................................................................................................. 9

Foreword ............................................................................................... 11

Case of the Bhai .................................................................................... 13

The human genome project and how it changed everything .............. 19

Genome variations and how they makes us different? ........................ 29

A brief introduction to next generation sequencing ........................... 37

When you could sequence your own genomes .................................... 43

So what if we could sequence just the protein coding genome? ......... 49

When should you do exome sequencing? ............................................ 55

When should you probably not do exome sequencing? ...................... 61

First things first: putting insights before data ..................................... 65

Educating the patient and getting an informed consent..................... 71

Points to note when you outsource exome sequencing ...................... 81

Understanding the steps in analysis of exome sequence data ............ 85

How good is the exome sequencing data? ........................................... 91

Prioritizing, annotating and interpreting variants .............................. 95

Don't forget the validation ................................................................ 103

Ethical considerations in whole exome sequencing ........................... 107

The last word ...................................................................................... 113

Index.................................................................................................... 115

Page 11: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

10

Page 12: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

11

Foreword

I would easily pick ‘Next Generation Sequencing’ as one of the techniques that had an immediate and immense application in research and healthcare. Within a span of five years, almost every scientist and physician cannot afford ignorance of ‘exome sequencing’. With newspapers and internet screaming ‘genome’ everywhere, this handbook by Dr. Scaria and Dr. Sivasubbu is timely.

The introductory chapter on ‘Bhai’ is a story of ‘exome

sequencing’ that is lucidly told even to general public. It is really important for everybody to know, let alone clinicians what sequencing is and how human genome project has improved our understanding of role of genetic variants in health and disease. The authors then introduce readers to exome, clinical importance of sequencing it and the situations where this is helpful in patient care. At the same time they warn the physicians not to get carried away. In the next chapter they explain the basics of medical evaluation and how they remain evergreen even in the current era.

It is important the patient is not taken for a ride by the new

diagnostic companies which did not exist the previous year. Both clinician and the patient must be aware of what they are doing with the new test and what they can expect in the form of results. Probably both need to be involved thoroughly in the consenting process.

For a researcher, the authors explain how outsourcing is not

easy despite having several service providers and detail in simple terms how the large data can be analyzed. Chapters on quality control and interpretation of variants serve the readers to understand the intricacies of this technique. Independent validation of the results is vital to apply this technique in clinical

Page 13: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

12

practice, especially prenatal diagnosis. To conclude the authors elegantly touch upon the ethical issues that cry for attention. The ‘Did you know’ text boxes spread throughout the book are simply highlights of genetic milestones or common terms that should be ‘general knowledge’.

With excellent medical and scientific background and

pioneering this technique in our country both scientifically and socially, Dr. Scaria and Dr. Sivasubbu have done incredible job of cracking the hard nut of ‘exome sequencing’ and the book is a ‘must read’ for all clinicians and students of genetics. Girisha KM Professor and Head Department of Medical Genetics Kasturba Medical College, Manipal Manipal University

Page 14: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

13

Chapter 1

Case of the Bhai

The day after the Indian genome sequence was

announced1, we received a phone call from an individual who introduced himself as Bhai2. A phone call from Bhai comes with a lot of connotations, with the popular imagination being that of the underworld calling for extortion. Fortunately for us Bhai was neither part of the underworld nor was he interested in extorting money, for he must have been aware that we were not millionaires to extort money from.

Bhai nevertheless had a bigger problem at hand. He said that he had a skin problem and wanted us to talk to his doctor. On close discussion, it was evident that he suffered from an inherited genetic disease, which had affected multiple members of his family. Days later, it was understood from his physician that his family suffered from a rare genetic skin disease called Epidermolysis Bullosa (EB). EB encompasses distinct disease subtypes with a variable severity ranging from localized lesions to a more extensive or generalized form. The disease is caused by defects in a number of

1 The sequencing of the first genome of an Indian was announced on

8th December 2009. Source: http://www.pib.nic.in/newsite/erelease.aspx?relid=55470

Also published in: Patowary, Ashok, et al. "Systematic analysis and

functional annotation of variations in the genome of an Indian individual." Human mutation 33.7 (2012): 1133-1140.

2 Bhai in Hindi and Gujarati means brother. It is a popular surname attached to most Gujarati names. In colloquial terms, this would also sometimes be attributed to an underworld don.

Page 15: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

14

Did you know? Epidermolysis bullosa is a rare genetic disease of the skin presenting with blisters on the skin. The disease runs in families and has an incidence of approximately 1 in 50,000 individuals.

genes, mostly involved in maintaining the integrity of the skin layers. Mutations in one of the genes would result in fragility of the skin, thereby causing eruptions on the skin, resembling those that occur after burns. These eruptions or 'bullae' would sometimes break open, get infected and result in scarring and sometimes extensive pigmentation.

Bhai wanted the genetic lesion to be identified. This was a complicated task to begin with. We had two options, first was to systematically characterize the mutation by sequencing every single exon one by one using the conventional sequencing approaches, which might have cost us a lot in terms of time, money and effort; or use a genome scale approach without prior hypothesis to sequence multiple genes in one go, and possibly try to mine the variation from the haystack. A paradigm shift in the approach was in the anvil. We had worked extensively on setting up sequencing on a new technology that allowed us to sequence whole genomes or parts of genome, which consisted of protein coding genes specifically3. We also had laid our hands on systematically analyzing the genome data for variants.

3 One technology to sequence part of the genome, which encodes for

proteins, is called exome sequencing. The concept of this methodology forms the basis of this book and is detailed in the later chapters.

Page 16: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

15

What is a pedigree chart? Pedigree chart is a graphical document that details the ancestry of an individual. A pedigree chart is a very important tool to study the inheritance of diseases in a family over generations

There was another technical issue. A close study

of the pedigree revealed that almost half of the family members spanning almost three generations were affected with the disease suggesting an autosomal dominant inheritance. That would essentially mean the variant under question would be potentially heterozygous4. Now potential heterozygous variations could be difficult to identify. On one hand, you would require enough coverage5 to accurately call a heterozygous variation. On the other hand, differentiating a potential causative variation from a number of other changes is a tedious and challenging task.

There were also well-established workarounds for these problems. One approach was to sequence two affected individuals and see what set of variations overlap between the datasets and probably prioritize variations that could potentially change amino acids or

4 The human genome is diploid. That means we have two copies of

each chromosome, and therefore two nucleotides correspond to each position in the genome. If both the nucleotides are not the same, that means only one copy has a variation. Such a variation is called heterozygous.

5 Coverage here denotes the number of times the sequence of the

genome has been covered or repeated.

Page 17: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

16

Did you know? Keratin 5 (KRT5) is a cytoskeletal protein important for the integrity of skin. Mutations in KRT5 gene can cause Epidermolysis Bullosa Simplex.

functional regions. The other approach would be to sequence one affected individual and use computational approaches to prioritize variations that change amino acid sequences and then potentially check whether the variations are present in other affected individuals, which included both affected and unaffected individuals.

We decided to pursue the first path. We sequenced

the protein-coding region of the genes (exomes) in two affected individuals from two generations. Systematic overlap of the single nucleotide changes and filtering for potential alterations that could have caused the disease, identified a variation in KRT5 gene6. Fortunately, the gene KRT5 was previously associated with the disease. The variation was further investigated in a number of affected and unaffected individuals using conventional Sanger sequencing7 of the region around the variation. Interestingly, the same variation was present in all affected individuals but absent in all unaffected individuals tested, supporting

6 Vellarikkal, Shamsudheen K., et al. "Exome sequencing reveals a

novel mutation, p. L325H, in the KRT5 gene associated with autosomal dominant Epidermolysis Bullosa Simplex Koebner type in a large family from western India." Human Genome Variation 1 (2014).

7 Sanger sequencing is a molecular technology for sequencing nucleic

acids, discovered by and named after Fred Sanger. The conceptual methodology is detailed in the next chapter, later in this book.

Page 18: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

17

our observation and conclusion that the variant is causative of the disease in the family.

Page 19: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

18

Page 20: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

19

Did you know? Frederick Sanger (1918-2013), received two Nobel prizes in Chemistry, one for the discovery of the amino acid sequence of Insulin in 1958, and the second one for the sequencing technology in 1980, which eventually was named after him. The Sanger Center, now the Sanger Institute at Hinxton, which took a lead role in the International human genome project, was founded in his memory. The Institute is now one of world’s largest genome centers.

Chapter 2

The human genome project and how it changed everything

The quest to understand the sequence of DNA

was pioneered by Frederick Sanger, who also received the Nobel Prize in 1980 for the technique to determine the same. This technique popularly known as Sanger chemistry is practiced till date and is based on the concept that modified nucleotide bases could irreversibly terminate a DNA synthesis reaction, wherever they get incorporated. The principle is simple. One could clonally amplify short stretches of DNA and use the single strands as templates for DNA synthesis. Apart from pure nucleotides, the synthesis mixture could be spiked with abnormal nucleotides, which are modified and labeled. These abnormal modified nucleotides called di-deoxy nucleotides could terminate a synthesis reaction

Page 21: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

20

wherever they get incorporated by virtue of complementary sequence in the template strand. This chain termination would produce truncated products of different sizes. Each of the products would be different by one nucleotide and could be separated using gel electrophoresis. Earlier, radioactively labeled bases were used that enabled their detection using radiography, but later non-radioactive modifications were developed that allowed bases to be labeled either with specific fluorophores or light emitting molecules. The overview of the technology is summarized in Figure 1. This methodology was perfected in 1970s and it was not until a decade later that the technology matured and was automated, fuelling the quest to sequence genome.

The Sanger sequencing technology saw a number

of improvements. The major improvement was the automation and miniaturization of the technique. This saw the birth of automated capillary sequencers. In capillary sequencers, electrophoresis happened inside capillaries and the electrophoresis bands were detected using lasers. The automation significantly increased the throughput of Sanger technology enabling sequencing of larger genomes and is popularly dubbed the first generation sequencing.

Page 22: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

21

Figure 1. Conceptual overview of the Sanger sequencing technology. The technology relies incorporation of labeled di-deoxy nucleotides and chain termination.

Page 23: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

22

Figure 2. One of the earliest sequencers that used the Sanger sequencing methodology. The readout was obtained from the vertical gel electrophoresis. Courtesy: The genomics museum at CSIR- Institute of Genomics and Integrative Biology, Delhi, India

Figure 3. Automated capillary sequencer. Courtesy: The genomics museum at CSIR- Institute of Genomics and Integrative Biology, Delhi, India

Page 24: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

23

Did you know? Apart from the NIH led Human genome initiative, a parallel effort to sequence the human genome was initiated by a private company, Celera Genomics, led by Craig Venter. This was initiated in 1998 and was estimated to cost approximately 3 million US Dollars, far cheaper than the NIH led effort. The draft assembly was released and published in the year 2001.

The planning for

the ambitious Human genome project was started as early as in the year 1984, but the initiation of the project with appropriate funding started in the year 1990. The United States Department of Energy (DOE) and the National Institutes of Health (NIH) jointly funded the project. The project was started to complete in 15 years with a total outlay of approximately 3 billion US Dollars. Apart from the United States of America, the project also encompassed an International consortium, which included researchers from other countries including the United Kingdom, France, Australia, Japan and China.

The sequencing of the human genome involved

quite a cumbersome procedure. Initially, the genome of 3.3 billion bases was broken down into small fragments, each of approximately about 150,000 bases and cloned into bacterial vectors. These were further maintained and replicated by the bacterial mechanism for DNA replication. Each of these vectors were then sequenced and assembled independently, before putting the pieces together to assemble the chromosomes. This

Page 25: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

24

Did you know? The draft human genome was jointly announced by Bill Clinton, President of the United States of America and Tony Blair, the British Prime Minister on 26th of June 2000. The complete assembly of the genome was later announced on April 14th, 2003

methodology then came to be known as the hierarchical shotgun approach.

Meanwhile,

almost halfway through the publicly funded human genome project, a company Celera Genomics was formed in the year 1983. The company used a radically different approach that involved sequencing both ends of the short DNA fragments in a pair-end way, which was previously successfully used to sequence small bacterial genomes. The company promised to complete the genome sequence, at a much smaller cost of approximately 3 million US dollars and compete with the International consortium.

Page 26: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

25

Did you know? Another noteworthy event that happened in this timescale was the Bermuda declaration of 1996, also known as the Bermuda principles for early access to DNA information. The declaration set rules and regulations for the early public release of data generated by the International Human Genome Project in public domain. This was a significant shift from the well-practiced principle of releasing the data only after publication in a peer-reviewed journal. This declaration formed the basis of pre-publication release of genomic data, which is widely practiced even today.

The first chromosome to be sequenced was chromosome 22, one of the smallest chromosomes in the human genome. The chromosome sequence was published in the year 1999.

In March 2000,

the draft human genome was announced by the then US President Bill Clinton jointly with the British Prime Minister Tony Blair. The papers corresponding to the publicly funded genome and the Celera assembly were published in the journals Nature and Science respectively. Further improvements of the drafts were announced in the year 2003.

The Human Genome Project was unique in many ways. In one way, it was a mega-project that involved a large number of researchers, not only from the United States of America, who led the project, but also from other countries across the globe, majorly from Britain,

Page 27: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

26

France, China and Japan. The major aim of the project was to provide researchers with a working template for the human genome and provide them with tools and resources to start understanding the basis of genetic diseases in humans. The computational tools and methods developed as part of the human genome project also significantly helped in the completion of the genomes of other organisms, including many model organisms like mouse, rat, zebrafish, worm and fly, which have been extensively used to understand human diseases.

Page 28: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

27

Page 29: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

28

Page 30: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

29

Chapter 3

Genome variations and how they make us different?

The completion of human genome sequencing

led to two parallel large endeavors to understand the human genome. One effort spearheaded the functional characterization of the genome in terms of identifying transcribing8 and regulatory elements9, whereas second initiative focused on understanding the genomic variability.

The human genome is quite large, over three

billion alphabets, comprising of four nucleotides: Adenine (A), Thymine (T), Guanine (G) and Cytosine (C) placed on a string. Though the genome is quite similar between individuals, every one of us has changes and this variability in the human genome sequence is what largely makes us different. The number of variations between individuals is quite large, approximately 3,000,000 or 3 million. Given the large size of the human genome, this is approximately one variation in almost a thousand bases. Many of these variations do not have any impact in the functionality of the organism. Some of

8 Protein coding genes are transcribed to messenger RNA and further

translated to proteins. 9 Regulatory elements include regions in the genome that regulate the

expression of genes. Regulatory elements include promoters of genes, enhancers among others.

Page 31: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

30

Did you know? The Celera project included DNA from 5 donors selected from a pool of 21 individuals. The founder, Craig Venter was also part of the pool.

them are quite variable in the human population10. We should also note that the half of the genome is inherited from each parent11. Variations in the genome inherited collectively make us look, behave and sometimes act like our parents. Therefore, many of the variations could be surrogates of features that we inherit. Geneticists call these features traits12. As you would also have guessed, there are innumerable human traits. Many times, cataloging these human traits is a complex and tough task.

Understanding the genomic variations and its

association with human traits is by itself quite complicated. On one hand, we need to know the extent of genomic variability, whereas on the other hand, we would need to know which variation or sets of variations are associated with a particular trait. Sequencing a large number of individuals to understand the genomic variability would be a herculean task due to the costs involved and complexity

10 Variations that are quite variable in the population, i.e., have a

frequency more than 1% are popularly called as polymorphisms. Single nucleotide variations that are polymorphic are therefore otherwise called Single nucleotide polymorphisms or SNPs.

11 The human genome is diploid and one copy of each chromosome is

inherited from each parent. 12 Trait is defined as a quality or feature, especially of an individual. This

could be for example, hair color, color of the eye, height etc.

Page 32: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

31

of executing such a large project. But without a grasp of the genomic variability and an understanding of how the genomic variability could affect human traits, the fruits of genomics cannot be tested.

Now there were shortcuts available. The first

shortcut was that one could create a crude map of genomic variations by putting together information from multiple sources. The first source that scientists had laid their hands upon was the sequence data itself. The Craig Venter led genome assembly; popularly called the Celera assembly was one large resource. Apart from that, scientists had also put together sequences of smaller regions, sometimes genes and parts of genes in the public domain, and this created the next resource. So there was something to start with.

The genome is not randomly inherited from a

parent to the child. Genes are inherited as blocks of the genome, one from each parent. Hence, the variations too are inherited in blocks. So if someone could study common variations inherited in blocks, one could identify the blocks that are associated with a trait. Thus, we would be able to map the trait to the genomic region encompassing the block. So if one had a family in which a particular trait is inherited, say lack of the pigment melanin in skin, hair and eyes (leading to a condition called Albinism), one could theoretically study the blocks of the genome inherited from each parent to child and observe whether the people who had Albinism all inherited the same block of the genome. This is a somewhat complex approach, which geneticists call linkage mapping. Since children inherit a large number of traits from their parents differentiating each from one

Page 33: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

32

Did you know? Polymerase chain reaction is a molecular technique developed in 1983 by Kary Mullis to amplify a piece of DNA. This technique bagged him the Nobel Prize for Chemistry in 1993.

another becomes a humongous task. However the task becomes easier if one is lucky enough to identify large families with numerous affected individuals spanning multiple generations.

Now as we mentioned above, you could just

study common variations and blocks of genomes that harbor them. These are called as tag variations13. Now you could just study a small number of common variations to understand associations with common diseases. Well, before single nucleotide changes were employed, scientists used something simpler to tag genomic blocks. These were based on typing repeats in the genome. The locations of many of these repeats were common in the population and one could use simple techniques such as polymerase chain reaction (PCR) to type these repeats and their lengths.

13 Tag Single nucleotide polymorphisms (SNPs) are representative

variations which mark a stretch of the genome.

Page 34: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

33

Did you know? The DNA samples in the HapMap project involved individuals from Yoruba tribe in Ibadan, Nigeria, Chinese from Beijing, Japanese from Tokyo and people with European ancestry maintained at the Centre d’Etude du Polymorphisme Humain (CEPH) in France.

The technological advancements in optics and miniaturization of components including microprocessors and microelectronics that followed the genome era also saw many of these being applied to study genomics. The earliest advent was the creation of microarrays, which in the last decade revolutionized genomics. Scientists learned that one could immobilize small fragments of DNA onto glass slides14. Now these small fragments of DNA could be used to identify single nucleotide variations, by the mere fact that a complementary nucleotide if present could hybridize effectively. This became a quick and popular assay for typing variations in the genome. Further advancements in miniaturization saw higher densities of packing such fragments of DNA onto slides, and thereby enabling a larger number of variations that could be typed.

The ready availability of microarrays to study

variations provides huge impetus towards the understanding of genomic variations and associations with human traits. These studies extensively used

14 This is popularly known as microarrays.

Page 35: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

34

genome wide approaches to mark blocks in the genome and are popularly known today as genome wide association studies (GWAS). The later years saw the discovery of a large number of variations and their associations with human traits and diseases. This approach still seems to yield quite fruitful dividends in gathering genomic variations and their associations with traits and diseases.

A number of global initiatives to map genomic

blocks and their associations have provided us with a map of regions in the human genome associated with distinct human traits and diseases in various populations15. These efforts were notably the first popular approaches to collect genomic variations associated with human diseases.

Now coming back to the case of Bhai. While the

genome wide association studies were moderately effective in mapping genomic blocks associated with common diseases and traits, these approaches were futile in the case of rare genetic diseases. This was primarily because the genome wide association studies relied on common variants and common traits, whereas rare genetic diseases are caused by rare variants. In the earlier sections, we had mentioned an approach using

15 Welter, Danielle, et al. "The NHGRI GWAS Catalog, a curated resource

of SNP-trait associations." Nucleic acids research 42.D1 (2014): D1001-D1006. A visual representation of this map is available at URL:

http://www.genome.gov/gwastudies/

Page 36: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

35

repeats, called microsatellites16. Microsatellite based studies were the mainstay in mapping genes associated with such rare diseases, and often was cumbersome, time taking, costly, and the success was heavily dependent on identifying large families. A typical microsatellite study in a standard molecular biology laboratory would take months for data generation and analysis, which precluded its widespread application in clinical settings for want of expertise and infrastructure.

16 Microsatellites are also called Simple Sequence Repeats (SSRs) or

Short Tandem Repeats (STRs). They encompass small stretches of 2-5 nucleotides which occur in tandem.

Page 37: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

36

Page 38: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

37

Chapter 4

A brief introduction to next generation sequencing

After the announcement of the genome

sequencing, a silent revolution was taking shape at the technology front. A bunch of researchers were working hard to enable quick and cheap sequencing of nucleotides. The traditional Sanger sequencing lacked the speed and cost effectiveness to be able to sequence genomes. A number of research labs around the globe were approaching the problem in a variety of ways. The field also saw the convergence of technologies from multiple areas including nanotechnology, microelectronics and computing. These efforts led to the emergence of a spectrum of approaches, each different in their principle with their own set of limitations and advantages, but similar in their goal of providing cheap, fast and high throughput sequencing of nucleotides. These technologies came to be popularly known as the next generation sequencing (NGS), differentiating it from the first generation sequencing technology, which comprised of automated Sanger chemistry.

Briefly, Next generation sequencing refers to a

gamut of sequencing technologies, which differentiate themselves from the conventional Sanger sequencing in terms of the technology employed, significantly higher throughput of sequence generation, quality of the sequencing and reduction in per-base sequencing costs.

Page 39: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

38

Did you know? The pyrosequencing methodology relied on the release of a pyrophosphate with nucleotide addition. This pyrophosphate is acted upon by ATP sulfurylase and produces ATP in the presence adenosine 5´ phosphosulfate. This ATP reacts with Luciferin to produce oxy-luciferin and generates light, which is captured by the camera.

One of the earliest NGS technologies used was called massively parallel signature sequencing or MPSS, developed by a company called Lynx Therapeutics as early as in the year 2000. The MPSS technology is not in commercial use anymore and is rather of historical importance. One of the first commercial offerings in the NGS space came from 454 life sciences. The commercial 454 sequencers were launched in the year 2004. These systems used pyrosequencing approach to sequence nucleotides. Short fragments of nucleotides were captured on beads and clonally amplified in an emulsion covering the beads. The beads were further deposited onto microtitre plates. The bases were reversibly added, which on each cycle would release a pyrophosphate that was detected by imaging the cell on the microtitre plate, thus enabling scalability to sequence millions of short stretches of nucleotides. The sequencing technology became quite popular due to the longer read lengths and high quality data. The 454 sequencing technology was eventually acquired and marketed by Roche Diagnostics. Other two technologies that came to the commercial space were the SOLiD technology marketed by Life Technologies and the reversible termination sequencing technology

Page 40: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

39

developed by Solexa and later acquired and improved upon by Illumina in the year 2007. The SOLiD technology, which stands for sequencing by oligonucleotide ligation and detection, employed amplification of short stretches of DNA using emulsion PCR and ligation-based chemistry to sequence short stretches of DNA. The first commercial SOLiD sequencers were launched in the year 2007. Though historically methods like the massively parallel signature sequencing and colony sequencing were the forerunners of modern and more popular NGS approaches, many of these technologies are now in vogue and primarily of historical interest or have very specialized applications. Nevertheless, associated tools and methods including miniaturization, massive parallelization and methods for assembling short sequences still form the conceptual mainstay in the field. These methodologies are detailed in the later section of this book.

One of the popular and field tested technologies

practiced till date was that developed by Solexa. As legend goes, a couple of British scientists met at a bar in Cambridge over a pint of beer to chalk out a better chemistry to sequence nucleotides in high throughput. The informal summit at Panton Arms, dubbed by many as the Beer Summit was where the most popular next generation technology was chalked out. Shankar Balasubramanian and David Klenerman put together their chemistry and the laser detection expertise to develop the reversible terminator based sequencing technology. The startup Solexa provided flesh to their concepts, and the Genome Analyser, a commercial bench top next generation sequencer was born. The basic technology could be summarized as follows. Short

Page 41: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

40

pieces of DNA could be captured on solid glass surface using small adapters, and these stretches could be amplified on the slide to produce clonal bunches of DNA stretches.

These clonal bunches of single stranded bases

could then be further used as templates for DNA synthesis, cycle by cycle. In each cycle, a nucleotide attached with a fluorophore is added. This addition is recorded by imaging the slide. The fluorophore would be then removed, and the cycle goes on for the entire stretch of the DNA template. The series of images, which were recorded, would further be analyzed using computers to reconstruct the sequence of the stretch of DNA. The computer would systematically go through the images, cycle by cycle and reconstruct the order of nucleotides from the fluorophore that shined up at that particular cycle.

Figure 1. Overview of the Illumina NGS sequencing methodology

Page 42: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

41

A number of other technologies and conceptual methodologies also emerged in the later years. Of note are technologies developed by Helicos biosciences17, Pacific Biosciences and Ion Torrent. The Helicos sequencer was released in the year 2009 but did not become quite popular. The company later filed for bankruptcy, putting the technology to oblivion. The Ion torrent used a conceptually different technology, based on estimation of pH on silicon wafers. The sequencer was released in the year 2011 and the technology and product was later acquired by Life technologies. Pacific Biosciences also released a commercial sequencer in the year 2011, based on single molecule sequencing chemistry without amplification. The technology has many advantages compared to others, in that the single molecule chemistry obviates the PCR bias incurred in other sequencing methodologies, and in addition, provided very long reads, sometimes extending to kilo-bases. Such long reads have enormous applications like detection of structural variations. Nevertheless, the technology has not found widespread applications in regular clinical settings, but is quite popular among the research community, especially laboratories working on genomes that are difficult to assemble.

A number of newer technologies are presently in the anvil, and not yet available in the commercial space, including Nanopore sequencing based on protein nanopores for detection of nucleotide bases.

17 Helicos bioscience was co-founded in the year 2003 and imaged

individual DNA molecules. It also featured a chemistry, which prevented incorporation of multiple nucleotides in each cycle, dubbed ‘Virtual terminator’.

Page 43: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

42

Figure 2. The Illumina Hiseq 2500 Next Generation Sequencer Courtesy: CSIR Institute of Genomics and Integrative Biology, Delhi.

Figure 3. The Ion Torrent Proton Sequencer based on semiconductor chips. Courtesy: CSIR Institute of Genomics and Integrative Biology, Delhi.

Page 44: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

43

Did you know? Gordon E Moore, one of the co-founders of Intel predicted that the density of transistors in an integrated circuit would double every two years. This was commonly known as Moore’s law.

Chapter 5

When you could sequence your own genomes

Next generation sequencing was like a tsunami.

Though the early adopters of the technology saw its huge potential, many of the traditionalists were quite slow to realize the potential and the future. Sanger sequencing was entrenched in many clinical laboratories and was widely acclaimed for its reliability, quality and ease of use, with automation being a standard. During the early years, commercial next generation sequencing platforms were fraught with frequent machine downtimes, smaller read lengths, which practically limited its applications and usually had lower quality of reads compared to the traditional Sanger sequencing. Nevertheless, these technologies provided the much-needed throughput to enable whole genome sequencing in a foreseeable trajectory.

The revolution

in technological advancements and the resultant scale and throughput was phenomenal, so much that at one point, the speed at which the sequencing technology improved in terms of throughput and cost -reduction was comparable to the Moore’s law in the case of

Page 45: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

44

microprocessors. The phenomenal increase in the throughput and cost is depicted in Figure 1.

Figure 1. The dwindling cost of whole genome sequencing over the years. The X-axis denotes the timeline, and the Y-axis denotes the costs in US$ on a logarithmic scale. Data from http://www.genome.gov/sequencingcosts/ Retrieved Feb 04, 2015

What came next was the race to sequence human genomes. The first of course were the stalwarts themselves - Watson and Venter, who sequenced and made available their personal genomes. What came out of the sequencing was an astounding number of novel variations, which were hitherto not reported before. The years that followed saw large genome centers drastically shift to next generation sequencers and rapidly adapt themselves to the avalanche of data. There were a few new players also, notably the Beijing Genomics Institute, which at a point in time was the largest genome facility with over a hundred next generation sequencers.

Page 46: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

45

The rapid technological advancements during this period led to a major paradigm shift enabling genome sequencing amenable to small research labs. For the first time the power of genomics was being shared and tasted by not so endowed laboratories, which did not have the wherewithal to own and operate a large inventory of sequencers and compute, leave alone trained technicians and analysts.

Countries like India, which were not in the

forefront of technology during the initial human genome sequencing initiative, were quick to adopt next generation sequencing. What followed was a flurry of human genome sequencing announcements from across the world. The Chinese announced the Han Chinese genome sequenced by the Beijing Genomics Institute, while the Japanese announced the Japanese genome and the Koreans announced the Korean genomes. India was not far behind. The team from the CSIR funded Institute of Genomics and Integrative Biology (CSIR-IGIB), Delhi announced the first Indian genome. The flurry of genome announcements continued…..the African Genomes, Sri Lankan, Malaysian, Russian so on and so forth. Those were exciting times!! We would pour through online announcements of genomes sequenced, which were getting announced almost every month, and see if we could put them up together to derive scientific insights.

Being associated with the Indian genome

sequencing activity was a humbling experience. While it taught us much of the nuts and bolts of genome sequencing and analysis, it also provided immense insights into how the genome sequencing could be

Page 47: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

46

applied in the clinical practice. The costs of genome sequencing were also dwindling drastically and a thousand dollar genome and its promises were widely discussed.

While individual whole human genome

sequencing would reveal approximately three million variations, computational pipelines and datasets available for analysis can functionally annotate only a small portion of these variations. This has been primarily because the functional annotation of variations is dependent on computational methods that can predict whether the variation can change the protein sequence, structure and thereby their functionality. This would essentially mean that the bulk of functional annotations could be done for only variations that fall in protein coding regions of the genome. This is detailed in the next chapter. Having said this, it should also be emphasized that methodologies to functionally annotate and prioritize variations in regions of the genome not coding for proteins also exist, though have not been quite popular. Some of the early methodologies for prioritizing variations in non-protein coding genes have come out of our own laboratories. In addition, a number of newer methodologies to annotate functional variations in regulatory regions of the genome also exist and have been widely used in literature.

Page 48: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

47

Figure 2. The CIRCOS representation of the first Indian genome announced in 2009 and the title page of the publication. (Patowary et al. "Systematic analysis and functional annotation of

variations in the genome of an Indian individual." Human mutation 33.7 (2012): 1133-1140.)

Page 49: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

48

Page 50: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

49

Chapter 6

So what if we could sequence just the protein coding genome?

The previous chapter discussed the limitations in

analysis of whole genome data. So the natural question is that if the present methods of functional annotation are largely limited to just the protein coding regions of the genome, then why not just sequence this part? Such an approach has the potential to significantly reduce the cost of sequencing, ease of handling data and analysis and possibly implement it in clinical practice to aid diagnosis. This is popularly called as exome sequencing.

An exome is defined as the protein-coding region

of a genome. In the human genome, the exome is estimated to be approximately 1% of the genome or roughly about 30 million bases. Since the proteins form the major workhorse in the cell that modulate the biological functions and outcome, sequencing just the protein coding region of the genome offers a cost effective quick solution to screen for genetic mutations. A number of approaches have been in the anvil to extract and sequence just the protein-coding regions in the genome. Three major approaches are popularly employed to extract specific regions of the genome (also known as targets) for sequencing.

One approach would be to amplify specific

regions under question using standard polymerase chain reaction. Usually, the reactions are multiplexed and involve pools of primers that amplify selected regions of

Page 51: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

50

the genome under question. The products following the PCR reaction could be pooled together and sequenced. This approach is widely used to amplify smaller regions of the genome, but has limitations scaling accurately to larger sizes of targets, for example whole exomes, due to the fact that identifying optimum sets of PCR primers with comparable efficiencies and high specificity is challenging given the complexity of the human genome.

Figure 1. Conceptual outline of the gene structure with exons, introns and the un-translated regions. The blue regions denote the protein-coding regions, and the yellow regions denote the untranslated regions. The transcript is spliced to form the messenger RNA and then translated to functional protein.

Another popular approach has been the specific

capture of DNA corresponding to the specific regions under question. This technique efficiently used the principle of specific base pair complementarities to

Page 52: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

51

isolate specific regions in the genome. The capture reaction involves pools of single stranded nucleotides attached to solid surfaces, either beads or on glass surface. These pieces of nucleotides are designed to have complementarities with the regions or targets that require to be captured. Briefly, the genome is fragmented using ultrasound or specific enzymes known as restriction enzymes that can nick the DNA at specific intervals. This produces DNA fragments of approximately comparable sizes. The strands are then denatured and only fragments with complementarities to the stretches are isolated from the pool, thus enriching only regions that fall in protein coding regions as compared to the whole genome.

The targets are then processed for whole exome

sequencing following standard protocols. An overview of the two popular approaches to enrich for protein coding regions in the genome is summarized in Figure 2.

Though the approach seems to be simple and logical, exome sequencing also has its share of limitations. The first limitation is that it by design precludes genomic variations falling outside of protein coding regions, many of which are functional. The best examples are promoter variations, which change expression of specific genes and regulatory variations in the untranslated regions that are known to modulate expression of genes and stability of transcripts.

Page 53: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

52

Figure 2. Conceptual outline of the two popular methodologies for capturing specific regions in the genome. The first methodology involves capture probes immobilized on solid surfaces, while the second approach involves probes immobilized on beads.

Figure 3. Conceptual overview of the major steps in primary analysis pipeline, which involves sequence quality check, alignment of high quality reads to the reference genome.

Page 54: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

53

The second major caveat of exome sequencing is that specific types of variations cannot be accurately typed. The best example could be chromosomal abnormalities, especially when there is no net change in the copy numbers. The best examples of such variations being translocations and inversions. Since the capture methodology enriches specific stretches of the genome, without keeping the context of the genomic region it came from, it would be impossible to decipher such events, unless the breakpoint occurs within the protein coding region, as in the case of the well-studied PML-RARa translocation in leukemia. Though new computational tools enable the characterization of copy numbers from exome sequencing data, it should be emphasized that exome sequencing is still not the most accurate methodology to look for chromosomal abnormalities, which include a copy number change.

These limitations aside, sequencing just the protein coding part of the genome has its advantages. The first being the cost, which is significantly lower than whole genome sequencing. The second being the relatively small amount of data, which makes it easier to handle and less complex to analyze without reliance on huge computer infrastructure required to analyze human genomes. The third advantage being the ready availability of methods and tools to systematically analyze data including online resources, which makes analysis and interpretation a bit easier for clinicians.

Page 55: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

54

Page 56: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

55

Advantages of whole exome sequencing in clinical settings 1. Fast- 1-4 weeks turn around

time 2. Holistic as it covers

majority of known disease causing gene loci

3. Cheaper in specific situations

Chapter 7

When should you do exome sequencing?

So the obvious next question would be when should I do exome sequencing?

Let us go back to the case of Bhai. The molecular diagnosis and confirmation of the disease would require sequencing of approximately 20 amplicons using Sanger sequencing approach in a traditional diagnostic setup. Standardizing the PCR amplicons and performing the sequencing is a tedious, time-consuming and sometimes expensive proposition, which makes the accurate molecular characterization of many diseases a challenge.

The second is a scenario where there are a number of differential diagnoses. There are many examples for such cases in regular clinical settings. In such situations, the accurate molecular characterization and diagnosis of the disease would require sequencing of multiple loci and genes, which on several settings, as in the previous situation, might become tedious, time-consuming and expensive.

Page 57: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

56

Exome sequencing is an alternative new approach in such scenarios for a number of reasons. Exome sequencing is quite fast with commercial turnarounds in the range of weeks, if not months. The approach is holistic, in the sense that it covers a majority of genes involved in Mendelian diseases. In addition, in many cases, which involve a number of genes or exons for confirmatory diagnosis, it might be cheaper than traditional approaches.

The third scenario is where there is no diagnosis and the presentation is quite rare, or there are multiple affected family members or a situation, which involves consanguinity. After exclusion of chromosomal abnormalities and structural variations, exome sequencing might be an interesting approach to follow in such situations.

The fourth and probably the commonest case where

exome sequencing is warranted is when a definitive clinical diagnosis has been made, but specific variant or variants that are associated with the diseases are reported unaltered. This would hint towards the involvement of a novel variant or new gene loci, which would benefit significantly from a holistic approach like exome sequencing.

The fifth situation where exome sequencing would

be extremely beneficial is in cases where a specific molecular diagnosis is expensive and possibly not available in the specific local situation or country or in cases where the timelines for diagnosis would not be met by a conventional approach. Exome sequencing in

Page 58: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

57

The GUaRDIAN Consortium GUaRDIAN stands for Genomics for Understanding Rare Diseases-India Alliance Network. It is a consortium and network of clinicians, clinical geneticists and genomics researchers formed with the aim to use the power of genomics to understand the molecular basis of rare genetic diseases. More information on the consortium and how it could help you is available online at URL: http://guardian.meragenome.com

such cases would be useful on the economic front as well as on grounds of speed and efficiency.

The sixth scenario is in the case of undiagnosed

diseases with a clear or suggestive genetic cause. A number of international studies have suggested that whole exome sequencing would be a useful proposition to arrive at a definitive diagnosis in cases of undiagnosed diseases. Specific programs and studies have undertaken extensively exome sequencing to identify undiagnosed or rare diseases. These have provided insights and diagnosis to a significant number of cases in a cohort.

There are a number

of research settings where exome sequencing would benefit significantly. These are especially the cases of genetic diseases, which present with atypical presentations or additional features of otherwise clinically diagnosed conditions, where the possibility of finding novel variants and novel loci exits.

Page 59: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

58

The other research application of exome sequencing in clinical settings is in understanding the genetic basis of rare genetic diseases. A number of recent studies have shown that exome sequencing and whole genome sequencing could be appropriate genomics tools towards understanding the molecular dissection and discovery of novel mechanisms and gene loci involved in rare genetic diseases.

Figure 1. The quadrant where the optimum use of whole exome and genome sequencing is recommended.

In addition, as rightly described in Figure 1, exome

sequencing has rightfully found its place in the discovery of rare mutations with large effect sizes and genetic loci associated with common diseases. Exome sequencing has recently also been extensively used to discover rare variants associated with common diseases. This has been largely possible by sequencing individuals at ends

Page 60: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

59

of the spectrum. A number of recent reports have shown that this approach is powerful and could provide a new opportunity to understand genetic variants with large effect sizes.

Page 61: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

60

Page 62: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

61

Chapter 8

When should you probably not do exome sequencing?

It should be noted that exome sequencing is not a magic bullet that can enable diagnosis of all genetic diseases; nevertheless it should be considered as a new technological advancement, which can provide valuable insights that can aid the diagnosis of majority of the genetic diseases. Exome sequencing is not without caveats. These limitations should be clearly understood so that the expectations from whole exome sequencing remain realistic.

The major caveat being that the approach can

only identify variations in protein coding regions of genes. A number of genetic diseases are known to be caused due to mutations in non-protein coding regions, including non-coding RNAs. Most of the newer exome sequencing panels also include untranslated regions, promoters and in some cases non-coding RNA genes. It should also be noted that many diseases are caused by variations in the introns and splice junctions. These might not be captured in a typical exome capture panel. So a clear distinction and informed decision is warranted before selecting exome sequencing as a method of diagnosis for such diseases.

Contrary to expectations, not all genes are

captured in typical exome sequencing. A number of exons, which encompass repeats or regions that have lot

Page 63: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

62

of Gs and Cs or in regions that are repeat-rich cannot be accurately captured and resolved by exome sequencing.

Exome sequencing is not useful in diseases

associated with chromosomal abnormalities and structural variations in the chromosomes (with very few exceptions). A large number of syndromes involve large chromosomal abnormalities including copy number and structural abnormalities. The capture methodology precludes the identification of such chromosomal abnormalities, especially ones that are not associated with a net change in the chromosome number. The exceptions in such cases are rare, especially ones involving the breakpoint within the protein-coding gene. Though standard pipelines for exome analysis are built to analyze single nucleotide variations and insertion deletion events, newer and specialized pipelines are presently available to detect copy number changes in chromosomes and breakpoints. It should be noted that such analysis is still in the research domain and have not been extensively applied in clinical settings.

A number of diseases are caused by repeat

expansions. The best-studied examples include Huntington's disease and some Spinocerebellar ataxias. Exome sequencing approach is not quite effective in diagnosing such diseases. This limitation primarily arises from the fact that most next generation sequencers are not able to accurately resolve repeats, especially simple repeats.

A number of diseases are caused by mutations in

the mitochondrial genes that show a unique feature called heteroplasmy, which means that mitochondria

Page 64: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

63

with multiple variations are present in the same cell. Standard exome capture and analysis methodologies significantly ignore the mitochondria, though some capture methodologies also systematically capture mitochondrial variations. In addition, pipelines for analysis of mitochondrial variations are also available. If you suspect a mitochondrial disease, and a maternal pattern of inheritance, it would be worthwhile to start with mitochondrial sequencing. A word of mention is also essential that not all mitochondrial abnormalities are caused by mitochondrial variations. A number of nuclear genes are imported into the mitochondria and mutations in these genes could also possibly manifest as mitochondrial abnormalities, nevertheless with a Mendelian pattern of inheritance.

A handful of rare diseases are caused by uni-

parental disomy. Usually the two copies of the genome are inherited, one from each parent. In some situations, both the copies of alleles are inherited from the same parent. Typical exome sequencing would not be able to identify whether the mutation came from one parent or both.

Page 65: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

64

Page 66: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

65

Before you decide on exome sequencing, collect the following information

Complete detailed family history and pedigree

Complete list of clinical phenotypes and results of clinical investigations

A complete list of differential diagnoses

Chapter 9

First things first: putting insights before data "Chance favors a prepared mind" -Max Perutz

The diagnosis of a disease is only as good as the

clinical work up you have done on the patient. Before prescribing for exome sequencing, you should have your options set and know what exactly your expectations are. Exome sequencing is not a panacea for all limitations for genetic diagnosis.

A complete family history and pedigree.

Let’s come back

again to the case of Bhai. In the initial conversations with the primary physician and Bhai himself, the only information that could be gleaned was that only members in his immediate family and close relatives were affected. On multiple encounters and a close study of his distant family tree over multiple visits and trips revealed that the disease was running in a much larger family, scattered over cities. Multiple coordinated attempts put together the comprehensive family tree and it was revealed that the disease has been running in the family

Page 67: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

66

for generations, and involved more than a dozen affected members. The family still remains the largest reported family affected with Epidermolysis Bullosa in India.

There is nothing better than a detailed family

history and a pedigree that can help clinch a clue and assist a great extent in arriving at the right diagnosis. The index case or parents might not be quite forthcoming on the family history, or in many cases might not be aware of the family history of the disease. It would be worthwhile to spend some time closely with the patient or other members of the family and collect detailed information of all the relatives around them, their health status including diseases, medications, deaths and clause of deaths, miscarriages, abortions, stillbirths and deaths in early neonatal and childhood.

Consanguinity18 is another key question. In many

cases the family might not be quite forthcoming on the consanguinity as is it sometimes a norm in many communities. In many cases all the relevant information cannot be gathered in a single sitting as the patients or parents might not be quite aware or might not recollect facts. So it would be useful to possibly gather the details over multiple interactions. If the patient or parents are not educated, it would also sometimes be necessary to ask pointed, but not suggestive questions regarding the diseases, deaths and causes thereof in the family.

A good detailed pedigree can permit

hypothesizing the mode of inheritance of the disease,

18 Consanguinity means shared kinship or blood relation.

Page 68: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

67

which would be extremely useful to prioritize variations in the exome data. For example, the family tree of Bhai helped us clinch a diagnosis of autosomal dominant Epidermolysis Bullosa. The genetic variant thus is expected in the heterozygous state in the exome, which necessarily meant we could have easily prioritized by sequencing two affected members in the family. A detailed chapter on prioritizing variants is available in the later part of this book. Similarly, a consanguineous marriage would suggest the possibility of a recessive19 disease and also suggests for mapping of regions of homozygosity (This is described in the later chapters as a methodology to prioritize variations after exome sequencing). The concurrence of disease in multiple individuals in an outbred family suggests a possibility of an autosomal dominant presentation, while a disease passed on through the maternal lineage through generations would suggest a mitochondrial mode of inheritance.

A complete list of clinical phenotypes and clinical investigations

Apart from the detailed pedigree, a thorough

clinical examination and enumeration of the clinical findings is an important aspect that should not be overlooked. In cases of clinical presentations like facial dysmorphology20 or skin abnormalities, a detailed description of the findings is necessary. It would also be

19 Both copies of the gene would require to be mutated to manifest an

autosomal recessive disease. 20 Dysmorphology is the study of birth defects, especially involving the

morphology of the body.

Page 69: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

68

Did you know? The Online Mendelian Inheritance in Man (OMIM) database is a comprehensive online database of human genes and disease phenotypes. The work on collecting Mendelian diseases and traits was originally initiated by Dr. Victor A. McKusick in 1960s and was available initially as a book. The electronic version of the compendium was made available online in the present form from 1995 through the National Center for Biotechnology Information. The present OMIM is curated and maintained by McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, USA.

worthwhile to have clinical photographs of the features to avoid ambiguity and to enable other clinicians or clinical geneticists arrive at an independent conclusion. In case of patients with diseases manifesting with abnormalities in levels of metabolites in the blood, a detailed investigation towards this end is also an essential clinical activity to the worked upon.

A complete list of differential diagnoses

The clinical

findings and investigation reports together with the detailed pedigree forms the basic set of clues enabling one to arrive at a set of differential diagnoses. It would be worthwhile to enlist a detailed set of differential diagnoses before one prescribes exome sequencing in clinical settings. This would enable the

Page 70: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

69

prioritization of genes to be closely examined. Apart from the list of differential diagnoses, a list of genes that are involved in the disease also becomes handy while analyzing the exome data. A list of potential genes involved could be garnered from the Online Mendelian Disease in Man (OMIM) database. Furthermore, a number of locus specific variation databases enlisting variants in these genes and their pathogenic effects could be garnered from appropriate resources.

Page 71: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

70

Page 72: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

71

Chapter 10

Educating the patient and getting an informed consent

Before prescribing exome sequencing, it is

imperative to explain the entire method, its benefits and pitfalls. It is also imperative to inform the patient about potential risks of uncovering unanticipated facts, which could be gathered from the exome sequencing, including risks of late onset diseases, cancers and sometimes paternity. It would be therefore essential to take both parents under confidence before the exome sequencing is prescribed. A detailed information sheet that explains a non-exhaustive set of circumstances and or scenarios is appended at the end of the book.

The following major points need to be specifically

discussed with the patient before exome sequencing. 1) Samples collected: The patient need to be informed

how the samples would be collected (saliva, blood) and what amount of samples would be collected.

2) The analysis performed on the samples also requires to be explained to the patient. If any additional genetic/epigenetic/biochemical tests are required to be performed on the sample, this needs to be mentioned and how such a test would help in reaching the diagnosis.

3) Use of data and release: The patient requires to be informed whether the data would also be used for research and whether it would be released in a public

Page 73: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

72

Did you know ? Manuel Corpas, a researcher made available the genomes of himself and his family in a freely available and re-usable format on the internet, with the hope that people could download the data, analyze it and obtain new insights on the genome. This was popularly called the ‘Corpasome’. Such an approach could potentially make the genome analysis and derivative information up-to-date and comprehensive at any point in time, with enormous benefits in understanding the disease predispositions and or prognosis. The paper describing the dataset was published with the following citation: Source Code for Biology and Medicine 2013, 8:13 doi:10.1186/1751-0473-8-13 http://www.scfbm.org/content/8/1/13

database anytime. The benefits and risks of public release also need to be discussed.

4) Risks and discomforts: The risks and discomfort due to the methodology of sample collection, or having the exome sequence available should also be explained in detail. A few scenarios are explained below. a. The availability of

the sequence could put one in precarious situations including identification of an individual, inference of paternity, inference of specific features of the genealogy and possible prediction of risks to self and children, and in some times to other close relatives in the family.

Page 74: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

73

b. The information on the exome could be potentially leaked from multiple sources electronic or otherwise, which might have implications on the person and the family.

c. If a previous genetic screen has been performed for research or diagnosis, the exome sequencing would make you identifiable in such a situation.

5) Anonymity and privacy: The patient should be educated about the benefits and risks of being anonymous, and potential advantages of being non-anonymous. Specific case scenarios of data being publicly released as in the case of the ‘Cospasome’ could be discussed. If the patient requires being anonymous, the methodologies and measures whereby the anonymity would be maintained in a specific clinical setting needs to be detailed to the patient. The patient should also be educated that privacy and anonymity are not inter-dependent entities, and modern technologies could maintain anonymity and privacy, while benefiting from public release of the data. A recent paper from our laboratory details this concept21.

6) Masking results: The patient could be asked for a potential list of conditions or types of conditions, which need not be screened on the genetic data generated, and which might cause discomfort. Nevertheless, the patient also requires to be informed whether any of the diseases, which would benefit from reporting and is part of the ACMG

21 "Personal genomes, participatory genomics and the anonymity-

privacy conundrum." Journal of Genetics (in press) available at URL: http://link.springer.com/article/10.1007/s12041-014-0451-3

Page 75: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

74

guidelines (Detailed in the later chapter) need to be also excluded from the analysis or reporting.

Detailed consent provided to the patient and other participants as part of the GUaRDIAN consortium is enclosed below and would serve as a ready reference guide.

Page 76: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

75

RESEARCH CONSENT FORM Reference Code: Son/daughter/wife of…………………………………………….aged………….. Residing at …………………………………………………… ………………………………………………………………… Hereby consent to freely participate in the genetic study aimed at understanding the human genome. I have been informed about the implications of my personal genome data being made publicly available through public databases as well as scientific communications I have been advised to discuss my participation in this study with my family members I have been provided written information that may be circulated to them, if necessary. I have been further informed that personal and medical data collected during this study will be associated with my publicly available genome and may be used for scientific analysis My participation in this study is entirely voluntary and I am free to withdraw from this study as and when I feel so inclined. 1.I choose to disclose / not to disclose my Identity (select one option) 2. I choose to be / not to be Informed of the results of the analysis that may impact my health (Applicable only to those who have chosen to disclose their identity – select one option). 3. I choose to exclude the information attached on the "Exclusion Form" from analysis / public disclosure (Applicable only to those who have chosen to disclose their identity). (Signature/ Thumb impression of volunteer) (Date) Certified that the above consent has been signed in my presence. The purpose for which the sample will be used has been explained to the above volunteer. The individual is free to withdraw from the study as and when he/she feels so inclined. (Signature of the investigator) (Date)

Page 77: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

76

Exclusion Form I choose to exclude the following information from the questionnaire with respect to analysis or public disclosure (please indicate the rave/ant question numbers from the attached questionnaire) 1. Analysis. 2. Public disclosure:

INFORMATION FOR THE VOLUNTEERS 1.Purpose of study The principle scientific goal of this study is to explore avenues to study genetic variability between Individuals and to correlate the variability to the phenotypes. The data generated (i.e., human DNA sequence, medical information and physical traits) may be used for scientific and clinical research such as development of computational tools and interfaces for scientist, clinicians and individuals in addition to developing general public awareness on potential benefits and risks of having whole genome level information available to the public. 2. Enrolment procedures

A. Collection of baseline trait data: You are required to provide baseline trait data about yourself, including: data of birth, medications, allergies, vaccines, personal and family medical history, race/ethnicity/ancestry and vital signs (e.g. height, weight, blood pressure etc) in the attached questionnaire. B. Monozygotic twin: If you have any identical twin(s), such sibling(s) will need to provide consent for your participation in this research.

3. Tissue (Blood/Saliva) collection A. Blood sample will be collected from the upper arm by Venipuncture. Twenty-five ml of blood sample will drawn by an authorized medical or an authorized technician under the supervision of an authorized medical doctor, in the presence of the principal investigator. Fresh blood sample will be collected in designated containers (which will be provided by CSIR/IGlB). Serum would be isolated from the collected blood sample for biochemical analysis B. Saliva sample will be collected by voluntary spitting. Two to

Page 78: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

77

four ml of saliva will be collected in designated containers (which will be provided by CSIR/IGIB).

4.Genomic analysis Analysis of DNA RNA including but not limited to whole genome sequencing and other biochemical analysis will be performed on tissue samples collected from the individual. The nature and extent of analysis will be determined by CSIR/IGlB at its sole discretion. 5.Public release of research data Upon completion of genomics analysis, your DNA sequence data will be made available through the CSIR/IGlB website and other scientific communications (including but not limited to publication in scientific joumals). This information is for research purpose only and may not be used by you for any medical or clinical purpose unless the relevant research data (DNA sequence) is first confirmed and discussed in consultation with a health care professional. By signing this consent form, you hereby agree and authorize CSIR/IGlB to proceed with the full public release of your DNA/RNA sequence data and other information (data of birth, medications allergies, vaccines, personal and family medical history, race/ ethnicity /ancestry and vital signs) voluntarily made available by you, without any legal restriction and without your further consent through CSIR/IGIB website and database or other formats of standard scientific communications (including but limited to publication in scientific journals), and you hereby acknowledge the risk associated with the public release of such data and information. Your identity will be held confidential if you choose, even though the identity stripped information would be publicly available. 6.Risks and discomforts

A.Venipuncture: This procedure is associated with minimal discomfort and is free of significant adverse effects. B.Data analysis: You are strongly advised to discuss this study and the potential risks. as outlined below with your Parents, Siblings and Descendants, hereinafter family members, as well as your health care provider(s). You are also advised to directly discuss any additional concerns with the Principal Investigator.

The following non-comprehensive list of hypothetical scenarios that could pose risk for you and your family members:

i) The data provided by you (such as traits and vital signs or DNA sequence data) may be used to identify you, resulting in higher than normal levels of contacts from the press and other members of the public. This could result in a loss of privacy and personal

Page 79: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

78

time. ii) Anyone with sufficient knowledge and resources could take your DNA sequence data and or your personal trait information and utilize the data, with or without modification, to (1) infer paternity or other features of your genealogy, (2) reveal the possibility of a disease or risk for a disease. Such information could lead to social and financial consequences including but not limited to employment and insurance. iii) Your family members could also be subject to discrimination for employment, insurance or financial service on the basis of the public disclosure of your genetic and trait information. iv) If you have previously made or plan to make available genetic information In a confidential setting, the data provided by you as part of this study may reveal your identity. v) Any conclusions derived from the publicly available information may be speculative with rasped to you and even less predictive with respect to your family members. The complete set of risks posed to you and your family members due to the public release of the DNA sequence and trait data is not known at this time. We encourage you to discuss this aspect with your family members.

7. Benefits (i). At present there are no proven benefits to you for your participation in this study. (ii). This study may benefit the medical and research community in particular, and humanity in general and may help in establishing genetic causes and predisposition for common diseases. (iii). You may experience satisfaction from participating in research that may benefit medical science. 8. Intellectual property rights and benefit sharing You will not be financially compensated for your participation in this study. Neither you nor your heirs shall claim from CSIR/iGl8 any financial benefits or rights, for any information, data, discoveries, whether or not of a commercial nature, made using the information generated in this study. However as per international (HUGO, UNESCO) and National Guidelines (National Bioethical Committee, Ethical Guidelines for Biomedical Research on Human Participants) it is necessary for national/international entities deriving economic benefit out of the knowledge resulting by the use of the human genetic material, to dedicate a percentage (e.g. 1%-3%) of their

Page 80: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

79

annual profit for the benefit of the community/ public health. 9. Confidentiality The results of this study may be published in a medical book, journal, website or webpage or used for teaching purpose. Your name and other identi6ers win be disclosed only if you have consented to disclosure of your identity, You may not be notified by CSIR/IGl8 prior to such use. 10.Withdrawal of participation Participation in this study is voluntary. You may withdraw your participation and/or your data from this study at any time, as described in the consent form. However once the DNA sequence and associated information is in public domain it is likely to get disseminated widely and rapidly. Therefore it may not be possible to retract the data in response to a withdrawal request.

Page 81: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

80

Page 82: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

81

Chapter 11

Points to note when you outsource exome sequencing

A large number of commercial enterprises now

provide whole exome sequencing as a service. As stated before, there are large number of competing capture methodologies and sequencing technologies, which make the decision on the appropriate technology a bit cumbersome and sometimes extremely challenging. Nevertheless, the challenges aside, there are a few questions that need to be kept in mind before outsourcing exome sequencing in clinical settings. This section is designed to provide a basic guideline on specific points that are to be considered, and not as a guide to select a particular methodology of technology.

The capture methodology and capture efficiencies

As mentioned before, it is a good point to keep note of the target genes and exons captured as there are a number of capture methodologies with varying amount of bases captured in the genome and with varying efficiencies of capture. This is important in the context of patients with known genetic diseases, where you are keen to look for a known variant or variants to confirm the diagnosis. It is important to make sure the genes and specific exons are covered efficiently in the specific capture methodology under question. The capture efficiency of the target region is also important to be noted after the sequencing is being done. Details of how to go about this are mentioned in the later chapter on data analysis.

Page 83: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

82

Sequencing technology, quality of reads and data throughput

A number of sequencing technologies are available in the commercial space. Therefore, it is important to keep a note on the sequencing technology employed before you finalize on the methodology. A rule of the thumb is to go with a methodology that would provide ample number of high quality reads at an affordable cost. More on how to evaluate this after the sequencing is performed is detailed in the later chapter.

Depth coverage of the target regions

In a regular clinical setting, for diagnosis of rare genetic diseases, it would be worthwhile to have at least 100x coverage of the exome. This is due to the fact that the capture efficiencies are variable across the genome, and an average coverage of 100x would essentially have in practical situations, almost all target regions adequately covered to enable variant calling. It is also imperative to look for what percentage of the target region has good coverage to enable accurate variant calling.

Availability of raw data and alignments

While outsourcing exome sequencing, one should also insist that the raw data with qualities (preferably in FASTQ formats) and alignments should be available. This is an important consideration due to a number of reasons. The first and the prime reason being that the field is still naive, and so are the methodologies for analysis. Apart from the information on the particular variant in question, the exome also contains a number of variants, many of which could also give insights and

Page 84: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

83

additional clinical implications. Secondly, in many cases, it is necessary to go back to the data and reanalyze at a later point in time to arrive at an appropriate diagnosis in light of disease progression and new clinical findings.

Variant calls, formats and interoperability

A number of service organizations offer the variants in custom formats, usually in tab-delimited formats or even excel sheets. It would be necessary to keep a note that all variant calls be available in standard interoperable formats. The commonly employed standard format for variant calls has been the VCF format. The VCF22 format includes all necessary information to reanalyze the variants for prioritization, especially the read coverage around the variant, the variant quality and samples that have the particular variants, in case of trios. Additionally, VCF formats are interoperable and are accepted by most online resources and software that aid the analysis of exome datasets.

Details of the analysis pipeline with parameters

The results of an exome sequencing analysis could drastically vary depending on the analysis pipeline employed and especially the parameters used for sequence alignment and variant calling. To ensure that the data is reliable and reproducible, it is imperative that the report has accurate description of the analysis pipeline as well as the parameters used in alignment, and variant calling.

22 VCF stands for Variant Call Format. This format came into existence

after the 1000 Genomes project and is widely used in the community. A number of bioinformatics tools and resources for analyzing variant data take variant data input as VCF files.

Page 85: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

84

Datasets used for annotation, versions and updating. As much the analysis tools and parameters affect

the variant calls, the datasets used and their versions also have a large impact on the conclusions derived. Many of the datasets of genomes, genes and variants are regularly updated and have non-trivial changes between the versions released. It is thus important to keep a note of the versions of the databases used so the results could be appropriately interpreted and the analysis be appropriately reproduced.

Page 86: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

85

Chapter 12

Understanding the steps in analysis of exome sequence data

The major steps in analysis of the exome

sequence data could be summarized as follows. The first step involves quality check of the data. The second step involves alignment of the sequence reads to the reference genome. The third step would be the analysis of the alignment to call variants and the fourth step would be to annotate and analyze the variants. The steps involved in the entire process are summarized in Figure 1.

Figure 1. Summary of steps involved in the analysis of the exome.

The nucleotide data generated by the sequencer

is usually available in a file format known as FASTQ (which stands for FASTA with Qualities). As you would have imagined the file contains sequences with their

Page 87: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

86

base qualities. The base quality part is important to note here, because it tells how good is the sequence read, and only a good quality read would provide you a good quality variant for further analysis. The FASTQ files are quite large, and in most cases cannot be opened on your word processor or text editor. Nevertheless, it would be worthwhile understanding what the file contains and what it would mean. The FASTQ file would essentially have 4 lines corresponding to each read, and there could be millions of such reads in the file, arranged one after another. Briefly, the first line starts with an ‘@’ followed by the information on the read. This usually has information of the sequencer, the run name, date, and this might not be of use to you in a regular case. The second line contains a string of ATGCs, which is essentially the nucleotide sequence of the read. The third line starts with a ‘+’ and in some cases repeat the information as in the first line, while sometimes it is empty, to avoid redundancy. The fourth line, in many cases contains characters that read like gibberish and this is the representation of the quality of each base in the read. So essentially the number of characters would be exactly same in the second and the fourth lines, as there is a quality representation for each read. The gibberish is nothing but the ASCII character23equivalent to the quality score.

23 ASCII stands for American Standard Code for Information

Interchange and it comprises of numerical representations corresponding to a character.

Page 88: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

87

@EAS54_6_R1_2_1_413_324

CCCTTCTTGTCTTCAGCGTTTCTCCTTGGCAGGCCAAGGCCGATGGATCA

+

;;3;;;;;;;;;;;;7;;;;;;;88;;;;;;;;;;;9;7;;.7;393333

@EAS54_6_R1_2_1_540_792

TTGGCAGGCCAAGGCCGATGGATCA GTTGCTTCTGGCGTGGGTGGGGGG

+

;;;;;;;;;;;7;;;;;-;;;3;83;;3;;;;;;;;;;;;7;;;;;;;88

@EAS54_6_R1_2_1_443_348

GTTGCTTCTGGCGTGGGTGGGGGGGCCCTTCTTGTCTTCAGCGTTTCTCC

+EAS54_6_R1_2_1_443_348

;;;;;;;;;;;9;7;;.7;393333;;;;;;;;;;;7;;;;;-;;;3;83

Figure 2. The FASTQ file format with sequences of the reads and qualities of bases in the sequence read.

The quality of reads across the read length is

usually expressed as a Phred score. The Phred score is nothing but ten times the negative logarithm of the probability that the base was incorrect. So if the base had a one in hundred chance for an error, which means a 0.01 probability of error, this would mean that the Phred score would be 20 (as follows 10x-log(-2)). So a Phred score of 30 would mean the base error probability would be one in thousand and a score of 20 would mean a probability of one in hundred and so on. There are a number of ways you could evaluate the quality of data. One approach is to plot the distribution of qualities at every base, and this plot serves a ready reference to see whether the sequencing was good or not. The quality of sequences could be quite variable because of issues in the library preparation or sequencing. If the reads on first place have issues with quality or with sequencers of the adapters used for sequencing, it is usually trimmed to exclude low quality reads and this step is otherwise known as trimming. How to verify the quality is detailed in the later chapter. The next step would be to align the good quality reads to the reference human genome. The selection of the genome version on build is very important as there

Page 89: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

88

are non-trivial differences in the positions of nucleotides and annotations of genes between the builds. A number of computational algorithms have been used extensively in literature to align the reads. The purpose of alignment is to find the cognate position of the read in the genome, and this would offer a way to compare whether the nucleotide is same or different across the read. As you would have rightly imagined, each genomic position corresponding to protein coding exons would be covered by a number of reads. This is otherwise denoted as coverage, or how many times the nucleotide is covered by reads. Once you have aligned the reads to the chromosome, you would find some positions that are different in the reads compared to the reference genome template. This information could be analyzed using computers to derive which positions have a variant. As you have rightly guessed, a higher coverage would provide you with a better accuracy of the variants called. So if you imagine a homozygous variant, all the reads or rather majority of the reads would have the particular variant, while in the case of heterozygous variations approximately half the reads would have the particular change with respect to the reference template. This entire process is called variant calling. As mentioned before, a number of computational algorithms have been extensively used to accurately call variations in the genome.

Page 90: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

89

Figure 3. The alignment of reads to the reference genome. The positions where the bases in the reads are different from those of the reference genome is highlighted.

The variations in the genome are usually available in a standard format known as the VCF. VCF stands for Variant call format. A number of analysis software are able to appropriately recognize the variant formats and provide annotations to the variants in terms of information that would help clinch the diagnosis. The structure of the VCF file is summarized in Figure4.

Page 91: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

90

Figure 4. The VCF file format representation of variations in the exome.

Page 92: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

91

Chapter 13

How good is the exome sequencing data? There are three major parameters, which decide

whether the exome sequencing data is good or not. The first one is of course, the quality of the sequencing reads, the second one is the coverage depth across the target regions and the third is the alignment percentage across the genome.

The first parameter is possibly the easiest to

check. That’s the quality of reads. A number of tools, both online and offline are available to check the quality of bases. It should be noted that the distribution of quality of bases is as important as the mean quality of the bases. The scheme below shows the base quality plot for a good set of sequencing reads. The scheme also shows the advantage of looking at the distribution of the qualities compared to the mean quality at each base position.

Figure 1. Quality plots for good quality reads.

Page 93: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

92

Figure 2. Quality plots for bad quality reads. Note the low quality of reads towards the end.

The second important parameter to check would

be the coverage depth across the target region. On an average, for identification of rare disease variants in clinical settings, it is recommended to have at least 100x coverage worth of high quality data. The calculation would be dependent on the read length and total length of exome capture (in case of whole exome it is approximately 50 mb). For a 100 base read, this would mean 5 million reads, and so on.

The third important parameter is the alignment percentage. It denotes the percentage of the total reads which aligned to the reference genome. On an average, in a well-set experiment, more than 95 percent of the reads generated after capture should align to the human genome. At times, the percentage alignment could also possibly cross 99 percent, with good quality data. A low percentage alignment would mean a number of possible

Page 94: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

93

things that could have gone wrong. One of the major possibilities is contamination of the reagents. Other possibilities could include inefficient capture or sequencing. Adapter contamination is one of the first things to look in case the reads show an abnormal percentage alignment. An adapter contamination could also be identified in the FASTQC report, which would show over represented sequences. Over representation of particular sequences, especially repeat sequences would mean an improper capture or library preparation.

Apart from the total alignment percentage, the coverage of the target site is also an important consideration. For accurate variant calling, it is advised to have a good coverage across majority of the target sites. Skewed target coverage would mean an inefficient capture procedure.

Page 95: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

94

Page 96: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

95

Chapter 14

Prioritizing, annotating and interpreting variants

As described in the previous chapter, the real

analysis and interpretation starts after you lay your hands upon the compendium of variations called from the exome. Ideally we expect all variants to be in the standard VCF format, which makes it compatible and interoperable with most tools and resources available online for exome analysis. But before we go right into the thick of exome analysis, it would be imperative to conceptually understand how to prioritize variations. There are largely six approaches to prioritize variations from exome or whole genome sequencing data and these are summarized in Figure 1. The highlighted region denotes the exome sequenced and the panel below suggests the approach to filter or prioritize variations.

Such prioritization strategies could be employed

at any step and the selection of the approach is dependent on the specific case. If there are multiple affected family members as in the case of the Bhai, a linkage-based strategy is useful. One could potentially sequence multiple affected family members of the same family, and if possible, unaffected members too. A segregation-based strategy could be used to include all variations overlapping in the affected individuals and excluded in the unaffected individuals. Such an approach would be extremely useful in autosomal dominant diseases with multiple affected family members.

Page 97: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

96

Figure 1. Summary of popular approaches or strategies to prioritize variants from exome sequencing.

The second scenario as you would see involves a consanguineous marriage and you would expect an autosomal recessive pattern of inheritance of the disease causing mutation. Here a homozygosity based strategy, taking into consideration all homozygous variants and prioritizing them through standard pipelines would be the best approach to follow.

The third scenario involves a non-consanguineous

marriage with a probable autosomal recessive pattern of inheritance, where filtering the exome by exclusion for heterozygous variations could be the approach to follow. In some cases where the affected child is not available for testing, as in the case of abortions, sequencing both the parents for heterozygous variations associated with Mendelian diseases would be the alternative to follow.

Page 98: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

97

The further analysis of the exome data involves majorly three steps. The first step involves understanding variations, which cause a change in the amino acid sequence of proteins and predicted to be deleterious. The second step involves annotating the genes with respect to the disease candidates and the third step involves prioritizing variations using different strategies as in the specific case.

The first step is to obviously find variations, which could change the amino acid sequence of the protein and are predicted to be deleterious. As you would also have imagined, not all variations in the exome are important or could have a functional consequence. The variations that can cause a change in the amino acid sequence are called non-synonymous variations, while the variations that do not change the amino acid sequence are called synonymous variations. Not all non-synonymous variations are important. Only a small proportion of the non-synonymous variations in the exome change the amino acid sequence of a protein to produce a functional effect. These are variations that cause an amino acid change in regions of the protein that are extremely important for the function or the structure of the protein. These variations are generally called deleterious variations. Now whether a variation could potentially be deleterious or not, is largely derived from computational predictions based on what amino acid change is caused by the specific variation under question. Two computational tools are popularly used to prioritize deleterious variations. These includes SIFT and PolyPhen2. The algorithms use similar, but distinct approaches to annotate variations as deleterious or not.

Page 99: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

98

SIFT stands for Sorting Intolerant from Tolerant and uses evolutionary conservation of the amino acid at the particular position in the protein to predict whether the variation is deleterious or not. This is under the basic assumption that if the amino acid is quite conserved at a particular position in the protein, evolutionarily, a change to a less frequent amino acid at that position could be functionally deleterious and thereby evolutionarily discarded. PolyPhen2 is yet another algorithm to prioritize variations. The algorithm is a bit complicated, and apart from the conservation of position, also uses the structural context of the amino acid and additionally uses artificial intelligence methodologies to predict whether the change is deleterious in nature or not.

Both approaches individually might not be quite

effective in prioritizing the variations. So one approach that has been popularly employed by researchers is to use a consensus of both approaches to prioritize variations that are deleterious in nature. You should also however note that while a consensus approach might be highly specific, such a stringent approach might exclude some variations that are functionally relevant and the decision to use the tools in consensus or alone has to be decided on a case-to-case basis. The online applications that integrate these predictions are discussed later in this chapter.

The second step is to annotate the variations and

genes associated with the disease phenotypes under question. As mentioned in the earlier chapter, the complete clinical details come in handy here. A number of tools discussed later in this chapter can take in

Page 100: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

99

additional annotation of the patient phenotypes to prioritize variations.

There are two web-based resources that have

been extensively used to clinically annotate exomes. This includes Exomiser maintained by the Sanger Institute and PhenIX, both of which have been extensively used by clinicians worldwide to prioritize variations and possibly arrive at a diagnosis.

Exomiser has a web-based interface, where you

could upload the VCF variant file corresponding to the exome. The web interface also provides an option to upload exome variants from multiple samples in a family with associated pedigree information in a specified format. Briefly, you could upload the VCF file and optionally the pedigree annotation if you are having multiple individuals sequenced from a family. The resource also features additional options where you could input either the diagnosis of the patient or a set of phenotypes in case the diagnosis is not sure. There are additional parameters, which you could specify, and are optional. This includes

1. Minimum variant call quality: You could specify a

Phred score, say 30. 2. Maximum minor allele frequency (%): This option

allows you to exclude common variations by allele frequency. Could put a minimum allele frequency of 1%.

3. Remove off-target, intronic, synonymous variants, dbSNP variants and non-pathogenic variants options would allow you to exclude these variations from the report.

Page 101: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

100

4. Inheritance model: You could select the specifics if you are sure about the inheritance of the disease and this option is used to prioritize the variations. Otherwise could select none to display all variants in the report.

Figure 2. Screenshot of Exomiser with the different options.

Another similar resource that allows you to prioritize variations is PhenIX maintained by the Charite in Berlin.

Page 102: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

101

Figure 3. Screenshot of PhenIX with the different options. PhenIX has an interface quite similar to that of Exomiser and has an option where the user can input the phenotypes or traits using the autofill option, upload the VCF file and specify the inheritance model and the maximum allele frequency. Both tools prioritize the variations by pathogenicity or deleterious effect of the variation(s) and by similarity of the genes harboring these variations to the genes associated with phenotypes provided by the user.

Page 103: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

102

Apart from the deleteriousness of the variation, another parameter that would help clinch a diagnosis would be the allele frequency of the variation in populations. It is expected that most deleterious variations in the population would be quite rare in occurrence, so an allele frequency of less than 1 per 100 would be a quite legitimate frequency to choose to prioritize variations. In many cases, it could also be expected that the variation is novel and might not have appropriate allele frequency information data.

Page 104: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

103

Chapter 15

Don't forget the validation The validation of the findings from whole-exome

sequencing is as important as the exome sequencing itself. Most researchers and clinicians are not aware of the fact that exome sequencing and analysis is also fraught with its limitations. It is therefore necessary to independently validate the variation before confirming the diagnosis.

There are two scenarios where validation is to be

considered. In the first scenario, the variant is known and implicated in the disease previously. Here the validation is quite simple, in the sense, the finding needs to be verified independently in the sample or samples. Traditional Sanger sequencing approach is what is commonly used in the field, especially for single nucleotide variations. Polymerase chain reaction primers could be designed around the variant under question and the region could be amplified and sequenced to confirm the diagnosis.

The second scenario is where you have identified

a new variant in a known gene. The first line of evidence that would clinch on the variant would be segregation of the variant in the affected members and a predicted deleterious effect. Wholesome participation of members of the family in such cases is required, and consent is required to be obtained (detailed in the ethical considerations section of this book). In some cases, participation of other family members would be impossible to obtain, due to privacy and anonymity

Page 105: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

104

concerns. Another approach to validate a generic variant would also be to see the segregation in a trio. In some cases, especially in sporadic cases, and in specific social circumstances, it might not be possible to approach other family members or in some cases not even parents but nevertheless the pathogenicity requires to be proven unequivocally. In such circumstances, a number of advanced methods have been adopted in literature. These include validation of the finding using specific assays at the protein level or at a cellular level using advanced gene cloning, expression and sometimes genetic engineering approaches. These technologies are specialized applications, mostly in the research domain and clearly out of the purview of this book.

The third scenario is where you stumble upon a

new gene and variant that causes a disease. While segregation and or homozygosity mapping in cases of consanguinity and filtering based on allele frequencies could clinch a conclusive diagnosis, many cases also leave a margin of error or doubt in the diagnosis and implications of the genes involved. Functional validation of such new genes is presently a realm of research laboratories as no clear cut and wholesome methodologies exist to systematically validate the functional effects. Apart from the popular cell culture systems, a number of research laboratories employ model organisms to functionally validate the gene and model the disease process. Model systems are useful to validate the physiological processes, especially in cases of developmental defects or structural abnormalities, which would be difficult to validate in cell culture systems. Nevertheless, cell culture systems are useful to validate specific processes including metabolic pathways

Page 106: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

105

and genes involved in specific processes at a cellular level. The popular model organisms used to validate disease genes include vertebrate and non-vertebrate systems such as mouse, rat, zebrafish, fly and worm. Our group employs zebrafish, which is a popular vertebrate model organism for functionally validating the novel genes.

Page 107: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

106

Page 108: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

107

Chapter 16

Ethical considerations in whole exome sequencing

There are a number of ethical considerations that have to be accounted for while performing and analyzing the exome sequencing in clinical settings. This is primarily because exome sequencing is unique in many ways, compared to traditional diagnostic approaches. For example, in comparison to most traditional diagnostic approaches, the fine line between diagnostics and research is quite blurred in the case of exome sequencing. This is primarily because unlike other diagnostic approaches, methodologies for exome testing and validation are still not quite established. In addition, since most of the clinicians would use exome sequencing for understanding rare diseases, the diagnostic accuracy in many cases cannot be established due to the paucity of numbers and unique nature of each patient. It should also be kept in mind that The basic tenets of investigations in genetics has to be based on the strong principles of beneficence, reciprocity, justice and professional responsibility.

Three major areas are covered in the following

section of this chapter. This includes educating and informing the patient, informed consent and handling incidental findings, and anonymity and privacy of the patient and family members.

Information and education

Page 109: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

108

Educating the patient on the technology, analysis process and interpretation is an important component. The patients need to be educated about the possible pitfalls, fallacies and limitations of exome sequencing. In addition, the patient would also require to be informed about incidental findings which could have clinical, social and emotional implications and one should be equipped to make an informed decision on the same. In addition, the patient is also required to be informed that a genetic testing of this sort could reveal information not just about the patient or family, but also information, which might be critical and relevant to other relatives in the family and possibly the next generation. The pros and cons of such information being available and implications of the same also need to be addressed.

Incidental findings and reporting

Exome sequencing is unique compared to the traditional research or diagnostic tests where the data generation is comparative to the questions asked, or rather, the chances of finding something incidental while performing a test is meager. The first set of diagnostics that started changing the paradigm was radiology, where whole body scans started churning out information than that was accurately required to answer the clinical questions. The more the data generated, in a generic form, the more incidental findings start to appear.

Exome sequencing is unique in this respect that

the sequencing allows a comprehensive scan of all variants in protein coding regions. This would include apart from the variant or variants that help in the diagnosis, other variants, many of which would have

Page 110: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

109

clinical implications or relevance. Many of the resulting findings may or may not have direct implications in the condition at hand, but might have long-term implications. One example could be variants that are associated with drug metabolism or adverse drug reactions. In some situations, the information might have implications in early diagnosis or prognosis, as in the case of inherited cancers. In many cases the distinction between the incidental finding and the study or target mutation under question also does not exist.

The traditional approach to such incidental

findings in clinical settings has been one of 'didn't look, didn't find, don't report' where the onus was on the doctor to decide what needs to be looked in the results and report what he or she felt was good or relevant for the patient. This paradigm might not always be the right approach to follow because the incidental findings by themselves could be of immense value to the patient, and possibly to another doctor treating the patient, as in the case of pharamacogenetic variants, which might help in modulating the dosage of specific drugs under question.

In addition, the case of exome sequencing is

unique compared to computed tomography (CT) scans in another way. While computed tomography scans could reveal in addition to the intended evidence, additional incidental findings, the relevance of the findings rarely change with time. In the case of whole genome or exome sequencing, since the field by itself is naive, and researchers are discovering new variants and attributions in terms of their clinical relevance, almost every day. Reanalyzing the exome sequencing data at a

Page 111: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

110

later point of time could possibly reveal new findings of clinical relevance. This unique situation would pose another interesting paradigm, where reporting of the exome is going to be a dynamic process, not an end point or static process in contrast to many traditional clinical diagnostic approaches.

The American College of Medical Genetics (ACMG) formed a working group to deliberate on guidelines for reporting incidental findings in exome and genome, which was published recently. The working group recommended the reporting of incidental findings for a set of specified disorders, variants and class of variants by evidence. This reporting is done irrespective of the primary indication for exome sequencing.

American College of Medical Genetics and Genomics Recommendations for Reporting Incidental Findings in Clinical Exome and Genome Sequencing

A comprehensive description of the methodology, recommendations, list of genes, variants and phenotypes is available in the document entitled ‘ACMG Recommendations for Reporting of Incidental Findings in Clinical Exome and Genome Sequencing’

accessible at URL: https://www.acmg.net/docs/ACMG_Releases_Highly-Anticipated_Recommendations_on_Incidental_Findings_in_Clinical_Exome_and_Genome_Sequencing.pdf

Apart from the incidental findings, the patient or

family members may decide to mask reporting on

Page 112: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

111

specific regions or loci variations that might have non-trivial implications. The consent should include a section where the patient or family members could explicitly state this.

Anonymity and privacy

Utmost care on anonymity and privacy is another important component of ethical conduct to the patient and family. It should be emphasized that anonymity and privacy are not two sides of the same coin, but are separate entities. A detailed discussion with the patient and family members is essential on this aspect. In many cases, the impact of the genetic testing is just not limited to the index case or family, but might have implications in the genetic predisposition and disease manifestation in the other family members too. Similarly, the identification of a mutation might not be relevant to the specific individual or family, but could be of relevance in terms of screening and carrier detection in other members of the family. As in the case of Bhai, the identification of a novel mutation in KRT5 gene would have implications in genetic screening and in some cases prenatal screening with implications for the other members of the family. In some cases the validation of the genetic variant would require participation of other members of the family, including people who might not be affected with the disease.

With the advent of Internet support groups and

patient groups, in many cases the patient of the family members do not like to be anonymous, since it might benefit the larger community and society. In some cases, the patient and family would like to remain anonymous given the social stigma associated with the disease and

Page 113: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

112

social implications with respect to other members of the family. It is therefore the educated decision of the patient or family that needs to be given utmost importance. Questions in this direction need to be non-suggestive, and should take into consideration the social, emotional attachments and long term implications.

Page 114: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

113

The last word

Exome Sequencing is only a means, not an end. It seemingly has a limited lifetime, largely popular and widely adopted due to the cost advantage and ease of analysis and interpretation. With dwindling costs and improved throughput of sequencing, it is imperative, not just plausible, that whole genome sequencing would be the mainstay in diagnosis of genetic diseases.

Page 115: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

114

Page 116: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

115

Index

4

454 · 38

A

Albinism · 31

alignment · 52, 83, 85, 88, 89, 91,

92, 93

Anonymity · 73, 111

anonymous · 73, 111

autosomal · 15, 67

B

Beijing · 44, 45

Bhai · 9, 11, 13, 14, 34, 55, 65, 67,

95, 111

Bill Clinton · 25

C

capillary · 20, 22

capture · 50, 52, 53, 61, 62, 63,

81, 82, 92, 93

Celera · 24, 25, 31

chromosome · 15, 25, 30, 62, 88

computational · 16, 26, 46, 53,

88, 97, 121

computed tomography scans ·

109

computer · 40, 53

coverage · 15, 82, 83, 88, 91, 92,

93

CSIR-IGIB · 45, 119, 121

D

deleterious · 97, 98, 101, 102, 103

diagnosis · 12, 49, 55, 56, 57, 61,

65, 66, 67, 71, 73, 81, 82, 83,

89, 99, 102, 103, 104, 108, 113

diagnostic · 11, 55, 107, 108, 110,

123

disease · 11, 13, 15, 16, 34, 55,

62, 63, 65, 66, 67, 69, 83, 92,

96, 97, 98, 100, 103, 104, 111

DNA · 19, 23, 24, 33, 39, 40, 41,

50

E

Epidermolysis Bullosa · 13, 66, 67

exome · 14, 16, 49, 51, 53, 55, 56,

57, 58, 61, 62, 63, 65, 67, 68,

71, 72, 73, 81, 82, 83, 91, 103,

107, 108, 109

Page 117: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

116

Exome · 1, 3, 16, 56, 58, 61, 62,

65, 108, 110, 113

Exomiser · 99, 100, 101

expression · 29, 51, 104

F

FASTQ · 82, 85, 87

fluorophores · 20

G

genomic variations · 31, 33, 34,

51, 121

GWAS · 34

H

Helicos · 41

heterozygous · 15, 67, 88, 96

homozygosity · 67, 96, 104

Human Genome · 16, 25

I

imaging · 38, 40

inherit · 30, 31

inheritance · 15, 63, 66

Inheritance · 100

inversions · 53

Ion Torrent · 41, 42

K

Koreans · 45

KRT5 · 16, 111

L

leukemia · 53

M

Malaysian · 45, 121

Mendelian · 56, 63, 69, 96

microelectronics · 33, 37

microprocessor · 33

microsatellite · 35

molecular · 16, 35, 55, 56, 58, 119

molecular biology · 35

mutation · 14, 63, 109, 111

N

Nanopore · 41

next generation sequencing · 9,

37, 43, 45, 119

non-synonymous · 97

nucleotide · 16, 19, 30, 32, 33, 40,

41, 62, 85, 88, 103

nucleotides · 15, 19, 21, 29, 35,

37, 38, 39, 40, 41, 51, 88

Page 118: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

117

O

outsourcing · 11, 81, 82

P

Pacific Biosciences · 41

PCR · 32, 39, 41, 50, 55

pedigree · 15, 65, 66, 67, 68

Phred · 87, 99

polymerase · 32, 49

PolyPhen2 · 97, 98

privacy · 73, 103, 107, 111

pyrophosphate · 38

R

regulatory · 29, 46, 51

restriction · 51

Russian · 45

S

Sanger · 16, 19, 20, 21, 22, 37, 43,

55, 99, 103

sequencing · 1, 9, 11, 12, 14, 16,

20, 21, 22, 23, 24, 29, 37, 38,

39, 40, 41, 43, 44, 45, 46, 49,

51, 53, 55, 56, 57, 58, 61, 62,

63, 65, 67, 68, 71, 73, 81, 82,

83, 87, 91, 93, 95, 96, 103, 107,

108, 109, 110, 113, 119, 121,

123

Shankar Balasubramanian · 39

shotgun · 24

SIFT · 97, 98

silicon · 41

Solexa · 39

SOLiD · 38

Sri Lankan · 45

T

Tony Blair · 25

trait · 30, 31, 34

translocations · 53

trimming · 87

U

United States · 23, 25

V

variation · 14, 15, 16, 29, 30, 46,

69, 97, 98, 101, 102, 103

VCF · 83, 89, 90, 95, 99, 101

Venter · 31, 44

W

Watson · 44

Page 119: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

118

Page 120: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

119

About the authors

Sridhar Sivasubbu Scientist, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB) Web: http://sridhar.rnabiology.org Email: [email protected]

Sridhar Sivasubbu’s laboratory is interested in exploiting the

advantages of zebrafish to dissect molecular mechanisms of gene function, regulation and genome organization in vertebrates. Research activities in his lab include deciphering non-coding RNA mediated regulation of blood and blood vessel development and development of zebrafish models for application in personalized and precision medicine in humans. His group is actively involved in mapping the genome and transcriptome of the wild zebrafish. His group was also responsible for the whole genome sequencing of human samples from India and other Southeast Asian countries.

Sridhar did his PhD from M.S University, Tirunelveli, India and

postdoctoral research at the Center for Cellular and Molecular Biology, India and the University of Minnesota, USA. He is a faculty at the CSIR-Institute of Genomics & Integrative Biology (CSIR-IGIB) since 2006. Sridhar also served as the CEO of The Center for Genomic Application, a Public-Private partnership company established by CSIR-IGIB for enabling research in the field of Genomics and Proteomics, where he spearheaded the application of next generation sequencing technology for commercial projects.

Page 121: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

120

Page 122: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

121

About the authors

Vinod Scaria Scientist, CSIR Institute of Genomics and Integrative Biology (CSIR-IGIB) Web: http://vinodscaria.rnabiology.org Email: [email protected]

Vinod Scaria is a clinician turned computational biologist. His

laboratory is interested in understanding the function, organization and regulation of vertebrate genome, and how genomic variations could potentially impact them. He is also involved in creating novel methods and resources for analysis and annotation of genomes and understanding the functional impact of genomic variations. He has been part of collaborative genomics projects aimed at understanding the Asian Genome diversity. He has also been part of the whole genome sequencing and analysis projects including the Indian, Sri-Lankan and Malaysian genome projects and is also a member of the HUGO Pan-Asian Population Genomics Initiative task-force. He has adopted novel and creative strategies, such as the use of social media, and the participation of a large number of undergraduate students in collaborative projects to accelerate genome annotation and co-creation resources for genome annotation.

Vinod did his undergraduate medical education from Calicut

Medical College, University of Calicut and PhD in Computational biology from University of Pune. Vinod has over 80 peer publications in international peer-reviewed journals and two book-chapters to his credit. He is also in the editorial board of PLoS ONE, PeerJ, Journal of Translational Medicine and Journal of Orthopaedics (Elsevier). He is also recipient of the CSIR Young Scientist Award for Biological Sciences in 2012. He was a member in the senate of the Academy of Scientific and Innovative Research (AcSIR)

Page 123: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

122

Page 124: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

123

Reaching the authors This book was written keeping in mind how genomic technologies could translate to patient-care. The authors would be happy to extend their expertise and resources to help the diagnosis of patients with rare genetic diseases. Interested clinicians and patient groups may kindly contact us for further discussion. You could reach us at: Email: [email protected] OR [email protected]

Register yourself to the Clinical Exome Group We have set up a unique Reader’s club to keep you updated about the new versions of this book and recent developments in the field. It would also be a unique opportunity to share your issues and find answers to your issues with exome sequencing and analysis and also discuss interesting cases with experts in the field. To register, follow this link: http://goo.gl/o9aAfC You could also leave your comments on our Facebook page: https://www.facebook.com/clinicalexome

Page 125: Exome Sequence Analysis and Interpretation

Scaria and Sivasubbu (2015). Exome Sequence Analysis and Interpretation

124

What readers have to say......

"The book is very well written, concise and provides an excellent collection of data capturing the transition of one era into another. Due emphasis was given towards the limitations of NGS along with its widely acknowledged benefits. It helps one to understand the basics of whole exome sequencing from a realistic viewpoint. Each chapter is well constructed and systematically elucidates situations where WES would be useful. Moreover, it provides an impetus for the clinicians to understand their contributions towards accurate phenotyping for better understanding of the genetic variations in a diagnostic set-up" Yenamandra Vamsi Krishna, Department of Dermatology, All India Institute of Medical Sciences, Delhi

Let us know what you have to say about this book on our Facebook page: https://www.facebook.com/clinicalexome

Page 126: Exome Sequence Analysis and Interpretation

Scaria V and Sivasubbu S (2015) Exome Sequence Analysis and Interpretation This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Cover Image: Artist’s impression of Nucleotides in a DNA strand. Oil on canvas by Pradha (2015)