Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
IDENTIFICATION OF GENETIC VARIANT IN BUFFALO
GENOME USING ddRAD SEQUENCE
A
DISSERTATION
SUBMITTED TO ORISSA UNIVERSITY OF AGRICULTURE &
TECHNOLOGY, BHUBANESWAR
IN PARTIAL FULFILMENT OF THE REQUIREMENT FOR THE
DEGREE OF
MASTER OF SCIENCE IN BIOINFORMATICS
BY
ANJAN KUMAR PRADHAN
Adm. No.-28BI/15
DEPARTMENT OF BIOINFORMATICS
CENTRE FOR POST GRADUATE STUDIESORISSA UNIVERSITY OF
AGRICULTURE AND TECHNOLOGY
BHUBANESWAR-751003
2017
Advisor Mrs. Sushma Rani Martha
2
3
CERTIFICATE –II
This is to certify that the dissertation entitled “Identification of Genetic Variant In Buffalo
Genome Using ddRAD Sequence” submitted by Anjan Kumar Pradhan, to the
OrissaUniversity Of Agriculture & Technology, Bhubaneswar in the partial fulfillment of the
requirements for the award of the degree of Master of Science in Bioinformatics has been
approved by the students advisory committee after an oral examination of the same in
collaboration with external examiner.
ADVISORY COMMITTEE
1. Dr. D.C. Mishra Chairman …………….
Senior Scientist
ICAR-IASRI,New Delhi
2. Mrs. Sushama Rani Martha Member
……………..
Asst. Professor
Department of Bioinformatics
3. Mr. Sukanta kumar Pradhan
Head of the Department Member
……………..
Department of Bioinformatics
External Examiner .……………..
4
ACKNOWLEDGEMENT
It is my priviledge to share my deep sense of gratitude to my advisor Mrs. Sushma Rani Martha, Asst.
Professor, Department of Bioinformatics, Orissa University Of Agriculture and Technology, for her
constant guidance, support and help without which the project would not have been successfully
completed.
I express my profound gratitude to Dr D.C. Mishra, Senior Scientist, ICAR-IASRI, New Delhi me to
carry out the dissertation work under his guidance.
I am thankful to the member of the advisory committee Mrs Sushma Rani Martha,Asst.Professor , Dept
of Bioinformatics, OUAT. Mr. Sukant Kumar Pradhan,HOD, Dept of Bioinformatics, OUAT and Dr
D.C. Mishra, Senior Scientist, ICAR-IASRI, New Delhi for their support and encouragement to carry
out this work successfully.
I would like to owe my sincere thanks to Sushree Didi for their encouragement.
I convey my heartly thanks to all my faculty members Mr. Surya Narayan Ratha, Mrs. Sucharita
Balabanta Ray and Mr. Sujit Kumar Dash for their guidance in each and every step during my
experimental laboratory work. I heartily thank to Dr.k. k. Chaturbedi for their help, support and giving
me innovative ideas for the completion of this project work.
I would like to express my heartiest and cordial regards to my beloved parents who are real inspiration
for me in every step of my life giving me unbound emotional support.I feel great pleasure to express my
love to all my sweetest friends for their strong mental support through out the project period and in
these two years.
I feel honoured to be a part of this auspicious university for providing me a healthy atmosphere in
these two years.Last but not the least I express my gratitude to god for invaluable inspiration for
accomplishment of such a splendid work.
Anjan Kumar Pradhan
5
CONTENTS
CHAPTER NO. PARTICULARS PAGE NO.
I. INTRODUCTION 1-3
II. REVIEW OF LITERATURE 4-12
III. MATERIALS AND METHODS 13-35
IV.
V.
RESULT AND DISCUSSION
CONCLUSION
REFERENCE
CURRICULUM VITAE
36-50
51
LIST OF FIGURES
6
FIGURE
NO.
PARTICULARS PAGE NO.
1. Difference between RAD & ddRAD 16
2. Per base sequence quality 20
3. Per tile sequence quality 20
4. Per sequence quality score 21
5. Per base sequence content 22
6. Per sequence GC content 22
7. Per base N cotent 23
8. Sequence length disribution 24
9. Sequence duplication levels 25
10. Adapter content 27
11. Kmer content 27
12. Trimming report 29
13. Staks diagram 32
14. SNPs(Milk yield trait samples) 39
15. Haplotypes 40
16. SNPs(Lactation period trait samples) 42
17. Haplotypes 42
18. SNPs (Age at first calving) 49
19.
Haplotypes 49
7
LIST OF TABLES
TABLE
NO.
PARTICULARS PAGE
NO.
1. ddRAD sequence 14
2. Basic statistics 19
3. Over represented sequences 26
4. Milk yield trait 37
8
Name of the Student : Anjan Kumar Pradhan
Admission No : 28BI/15
Title of thesis :Identification Of Genetic Variant In Buffalo
Genome UsingddRAD Sequence
Degree for which thesis submitted : Master of Science in Bioinformatics
5. Marker (Milk yield trait) in the population 38
6. Lactation period trait 41
7. SNP Summary statistics in the population 43
8. Haplotype Summary statistics in the population 43
9. Hapstats. Summary statistics in the population 44
10. Sumstats.Summary statistics in the population 45
11. Sumstats_Summary statistics in the population 46
12. Marker (Lactation period trait) in the population 46
13. Age at first calving trait 47
13.1 Marker (Age at first calving trait) in the population 48
9
Name of the Dept, & University : Department of Bioinformatics,
Centre for Post Graduate Studies,
Orissa University Of Agriculture
&Technology, Bhubaneswar,
Orissa, 751003
Year of submission : 2017
Name of the advisor : Mrs. Sushma Rani Martha
ABSTARCT
Bubalus bubalis (water buffalo) is an agro-economically important livestock species due to its
multipurpose use in India and other Asian countries. The aim of this study is to identify single
nucleotide polymorphisms (SNPs) from buffalo genome using ddRAD sequencing through
STACKS pipeline. Here we have used double digest restriction-associated DNA sequencing
(ddRAD) to identification and annotation of genetic variant from buffalo three traits such as
Milk yield, Lactation period, Age at first calving.The Stacks pipeline uses ddRAD-Sequence data
to create genetic maps and conduct population analysis. It assembles loci de novo from an individual’s
sequence reads or by using a reference sequence. These loci are catalogued and compared against other
individuals’ loci to create a map of alleles. Stacks can identify thousands of markers and use this
information to study genomic structure and assembly.Stacks employs a Catalog to record all loci
identified in a population and matches individuals to that Catalog to determine which
haplotype alleles are present at every locus in each individual.
Keyword:(ddRAD sequence, STACKS pipeline, Sequence Alignment Mapping, Genetic
Variant, NGS, Stacks Web Interface.)
10
CHAPTER-I
INTRODUCTION
WATER buffalo (Bubalusbubalis) was domesticatedapproximately 5000 years ago in India to
secure supply of milk, meat and power1 [Rudolph, M. C. et al 2007]. It has been grouped into
(i) swamp, primarily developed for draught purpose and (ii) river buffalo, primarily used for
milk production. Among the total of 13 recognized breeds of water buffalo, majority are milch
breeds in India and some of them have been listed on a state-level conservation plan by the
Ministry of Agriculture, Government of India2 [Ding, X. et al 2012]. As buffalo milk occupies
the highest share in Indian dairy sector, the future improvement in traits of economic
importance is dependent on genetic variation present within and between breeds. Even though
they have an important role in Indian agricultural economy, most of the breeds have not been
exploited for their full genetic potential. Recently, genomic selection in cattle has been adopted
globally to accelerate genetic gains3 [Van Horn, C. G., Caviglia 2005]. Molecular markers like
single nucleotide polymorphisms (SNPs) can play a significant role in livestock improvement
through conventional breeding programmes. However, the present genomic resources are
limited for river buffalo. Moreover, molecular genetic diversity in river buffalo is explored
using cattle-based microsatellite markers4 [Mashek, D. G. and Coleman 2006]. Taking
advantage of the availability of fully sequenced cattle genome and other related genomic
resources, and given the close evolutionary relationship between cattle and river buffalo. We
sequenced the river buffalo genomes on a large scale to detect genetic variants, in particular,
identified large-scale SNPs, which may help in the study of river buffalo genomics. Genetic
component plays a major role in milk production and other functional traits of dairy animal 5
[Mercade, A. et al 2006].
The advent of next-generation sequencing has enabled a robust and more cost-effective
approach for the identification of high-throughput SNPs. Recently, exome/targeted capture
sequencing has been used to analyse disease traits in livestock species because it is efficient
and costeffective6. In the present study we carried out targeted sequencing, for discovering
variants in and across targeted regions. To the best of our knowledge, there are no earlier
studies on targeted (exome) sequencing in river buffalo for high-throughput variant discovery.
11
Although there are many advantages to raising water buffalo as described above, these animals
remain underutilized. In particular, water buffalo breeders and farmers have been facing many
challenges and problems, such as poor reproductive efficiency, sub-optimal production
potential, higher than normal incidence of infertility, and lower rates of calf survival. Genome
research has created a broad basis for promoting and utilizing gene technologies in many fields
of livestock production. For example, genome biotechnology will provide a major opportunity
to advance sustainable animal production systems of higher productivity through manipulating
the variation within and between breeds to realize more rapid and better-targeted gains in
breeding value. This type of research will also make it possible to distinguish molecular
phenotypes and thus improve the use of genetic resources in domestic animals. Therefore, the
present review focuses on the currently available genome resources in water buffalo, thus
providing knowledge and technologies that can help optimize production potentials,
reproduction efficiency, product quality, nutritional value and resistance to diseases in the
species. Genetics is responsible for approximately half the observed change in performance
internationally in well-structured cattle breeding programs. Almost all, if not all, individual
characteristics, including animal health, have a genetic basis. Once genetic variation exists then
breeding for improvement is possible. Although the heritability of most health traits is low to
moderate, considerable exploitable genetic variation does exist.
Water buffalo provide more than 5% of the world’s milk supply, which contains less water and
more fat, lactose, protein, and minerals than cow milk [Schwehm, J. M 1998]. Water buffalo
milk is used to make butter, butter oil, high quality cheeses, and other high quality dairy
products. They have leaner meat that contains less fat and cholesterol than beef, while having a
comparable taste [Manjithaya, R. R. and Dighe 2004]. Their hide can be used to make good
quality leather products and they make good beasts of burden, providing 20% to 30% of all
farm power, and are superior draught animals in waterlogged conditions such as rice paddies.
Water buffalo utilize less digestible feeds than cattle making them easier to maintain using
locally available roughages. In addition, water buffalo are used as cash--to be sold when the
need arises; thus securing the economic status of many families. The husbandry system of
water buffalo depends on the purpose for which they are bred and maintained. They are often
referred to as "the living tractor of the East". It probably is possible to plough deeper with
12
buffalo than with either oxen or horses. India is considered as the home tract of some of the best
buffalo breeds. Because of preference of buffaloes for milk, many she buffaloes from the breeding tract
are moved to the thickly populated urban and industrial centre for meeting the milk requirements of this
population. Here generally they are slaughtered after completion of one or two lactation. Their
progenies allowed to die due to neglect and thus no replacement of superior germplasm is
possible. Indian buffaloes are in important source of milk supply today and yield nearly three
times as much milk as cows. More than half of the total milk produced (55%) in the country
was contributed by the 47.22 million milch buffaloes, where as the 57.0 million cows
contribute only 45% of the total milk yield. Indian Buffaloes are water buffaloes. There are
about 10 indigenous standard breeds of buffaloes, which are well known for their milking
qualities.
Bubalus bubalis (water buffalo) is an agro-economically important livestock species due to its
multipurpose use in India and other Asian countries. The aim of this study is to identify single
nucleotide polymorphisms (SNPs) from Buffalo Three Traits such as (Milk yield, Lactation
period and Age at fast calving) using ddRAD sequence through STACKS PIPELINE.Stacks
identifies loci in a set of individuals, aligned to a reference genome (including gapped
alignments), and then genotypes each locus. Stacks incorporates a maximum likelihood
statistical model to identify sequence polymorphisms and distinguish them from sequencing
errors. Stacks employs a Catalog to record all loci identified in a population and matches
individuals to that Catalog to determine which haplotype alleles are present at every locus in
each individual.
OBJECTIVES
• Data compilation and preprocessing of the ddRAD sequence data for three-traits in
Buffalo.
• Identification & annotation of genetic variant.
13
CHAPTER-II
REVIEW OF LITERATURE
Since the early 1800’s, breed development was based on phenotype selection on coat color and
polled phenotypes, and included the imposition of severe bottlenecks followed by breed
expansion via artificial insemination. During the last 50 years, animal breeding based on
quantitative genetics has resulted in a remarkable progress in improving production traits for
milk and meat (Andersson and Georges 2004). Therefore, selection (natural and human-
imposed) and nonselective forces (the demographic events and introgression) drove changes
within the cattle genome. Their combined effects have created exceptional phenotypic diversity
and genetic adaptation to local environment across the globe within the modern cattle breeds. It
is generally accepted that there are four mechanisms of evolutionary change: Mutation, genetic
drift, gene flow or migration (demographic history), and selection. However, only selection is
locus specific, while the first three forces work uniformly across the whole genome. Selection
can be divided into three modes: Positive, purifying (or negative selection, eliminating a
deleterious mutation), and balancing selection (including heterozygote advantage and
frequencydependentselection). Positive selection is a mode of natural selection that drives the
increase in prevalence of advantageous alleles due to their favorable effects on fitness (Biswas
and Akey 2006; Kelley and Swanson 2008; Oleksyk et al. 2010). Genetic hitchhiking refers
changes in the frequency of an allele because of linkage with a positively selected or neutral
allele at another locus. The availability of genomic data has spurred many approaches for
mapping positive selection, mainly based on reduced local variability, deviations in the marker
frequency, increased linkage disequilibrium (LD), and extended haplotype structure. These
methods such as CLR,CMS, FST, EHH, iHS, and hapFLK (Tajima 1989; Fay and Wu 2000;
Sabeti et al. 2002; Nielsen et al. 2005; Voight et al. 2006; Grossman et al. 2010; Fariello et al.
2013) have been widely used in human, mouse, rat, and domesticated animals like dogs, cattle,
sheep, pigs, horses, and chickens (Waterston et al. 2002; Gibbs et al. 2004; Rubin et al. 2010,
2012; Kijas et al. 2012; Petersen et al. 2013). One method (di) was recently developed to
identify genomic regions indicative of selection with a high degree of genetic differentiation
14
between dog breeds (Akey et al. 2010). Distinct from FST, which measures the fraction of total
genetic variation between two populations, the pi value is defined as a function of unbiased
estimates of all pairwise FST between one breed and the remaining breeds within a population.
It is suited for detecting selection specific to a particular breed, or subset of breeds, and
isolating the direction of change. It was utilized to track lineage-specific signatures of selection
in the dog and horse genomes, revealing its power to detect selection acting on both newly
arisen and preexisting variations (Akey et al. 2010; Petersen et al.2013). Selection mapping is a
powerful approach, together with genome-wide association studies, to detect candidate genes
associated with quantitative traits. Selection mapping in cattle has been previously investigated
using a lower densitymarkers like BovineSNP50 array (Flori et al. 2009; Hayes et al. 2009;
Qanbari et al. 2010, 2011; Stella et al. 2010; Rothammer et al. 2013). Only recently similar
studies were reported based on a higher density markers like BovineHD array (Porto-Netoet al.
2013; Utsunomiya et al. 2013; Kemper et al. 2014; Perez et al. 2014). More recently, sequence-
based signatures were reported in Fleckvieh (Qanbari et al. 2014). However, these studies
focused on limited breeds with specific traits. Therefore, it is possible many breed-specific
selection signatures remain undetected due to lack of comparison acrossbreeds. There are a few
of targeted studies of the haplotype pattern and evolution on selected gene families like Toll-
like receptors in cattle (Seabury et al. 2010). However, to our knowledge, no systematic effort
has been reported to investigate the haplotype pattern and evolution of positively selected
genes in the cattle genome. In this study, we investigated diverse genomic selection using high-
density single nucleotide polymorphism (SNP) data of five distinct cattle breeds, including
Holstein (HOL), Angus (ANG), Charolais (CHL), Brahman (BRM), and N’Dama(NDA).
HOL, ANG, and CHL are taurine breeds from Europe. HOLs represent the highest-production
dairy animals, originally from the Netherlands and northern Germany. Their black-and-white
color was due to artificial selection by the breeders. ANG cattle, first developed in Scotland,
are used in beef production. They are naturally polled (do not have horns) and solid black or
red in color. CHL is a dual purpose breed (both milk and beef) originated in France, which is
known for its large body size, bone structure, and white to cream coat. NDA is an indigenous
local taurine breed from West Africa. With a small size and fawn coat, NDA is well known for
its trypanotolerant and shows superior resistance to ticks and other parasites
(http://www.ansi.okstate.edu/breeds/cattle/, last accessed December 2, 2014). BRM is a
composite of several zebu breeds imported from India (Guzerat, Kankrej, Gir, and others), and
15
was first bred in America in the 1880s for beef production with a minor taurinecontribution
(Decker et al. 2014). The BRM is known for its gray coat, heat tolerance, and disease
resistance. We performeda genome-wide scan with the BovineHD SNP genotypes to map
selection signatures among these five diverse cattle breeds.
2.1Genetic Variant
An alteration in the most common DNA nucleated sequence. The term variant can be used to
describe an alteration that may begin, pathogenic or of unknown significance. The term variant
is increasingly being used in place of the term mutation. Mutations –changes at the level of
DNA; one or more base pairs has undergone a change; change could be at random or due to a
factor in the environmentMajor deletions, insertions, and genetic rearrangements can affect
several genes or large areas of a chromosome at oncePolymorphisms –differences in individual
DNA which are not mutationsSingle-nucleotide polymorphisms (SNPs) are the most common,
occurring about once every 1,000 bases or Copy number variations –some DNA repeats itself
(i.e. AAGAAGAAGAAG) and there can be variation in the number of repeats.Genetic
variation means that biological systems – individuals and populations – are different over
space. Each gene pool includes various alleleshttps://en.wikipedia.org/wiki/Allele of genes. The
variation occurs both within and among population, supported by individual carriers of the
variant genes. Genetic variation is brought about, fundamentally,
by mutationhttps://en.wikipedia.org/wiki/Mutation, which is a permanent change in the chemical
structure of chromosomeshttps://en.wikipedia.org/wiki/Chromosomes. Genetic
recobinationhttps://en.wikipedia.org/wiki/Genetic_recombination also produces changes within
alleles.
2.1.1Among individuals within a population
Genetic variation among individuals within a population can be identified at a variety of levels.
It is possible to identify genetic variation from observations
of phenotypichttps://en.wikipedia.org/wiki/Phenotype variation in either quantitative traits (traits
that vary continuously and are coded for by many genes (e.g., leg length in dogs)) or discrete
traits (traits that fall into discrete categories and are coded for by one or a few genes (e.g.,
white, pink, red petal color in certain flowers)).
16
Genetic variation can also be identified by examining variation at the level
of enzymeshttps://en.wikipedia.org/wiki/Enzyme using the process of protein
electrophoresishttps://en.wikipedia.org/wiki/Protein_electrophoresis. Polymorphic genes have
more than one allele at each locus. Half of the genes that code for enzymes in insects and
plants may be polymorphic, whereas polymorphisms are less common in vertebrates.
Ultimately, genetic variation is caused by variation in the order of bases in the nucleotides in
genes. New technology now allows scientists to directly sequence DNA which has identified
even more genetic variation than was previously detected by protein electrophoresis.
Examination of DNA has shown genetic variation in both coding regions and in the non-coding
intron region of genes. Genetic variation will result in phenotypic variation if variation in the
order of nucleotides in the Dna sequencehttps://en.wikipedia.org/wiki/DNA_sequence results in a
difference in the order of amino acidshttps://en.wikipedia.org/wiki/Amino_acid in proteins coded
by that DNA sequence, and if the resultant differences in amino acid
sequencehttps://en.wikipedia.org/wiki/Peptide_sequence influence the shape, and thus the
function of the enzyme.
2.1.2Between populations
Geographic variation means genetic differences in populations from different locations. This is
caused by natural selectionhttps://en.wikipedia.org/wiki/Natural_selection or genetic drift.
2.1.3Measurement
Genetic variation within a population is commonly measured as the percentage of gene
loci that are polymorphic or the percentage of gene loci in individuals that are heterozygous.
2.1.4 Sources
Random mutationshttps://en.wikipedia.org/wiki/Mutation are the ultimate source of genetic
variation. Mutations are likely to be rare and most mutations are neutral or deleterious, but in
some instances the new alleles can be favored by natural
selection.polyploidyhttps://en.wikipedia.org/wiki/Polyploidy is an example of chromosomal
mutation. Polyploidy is a condition wherein organisms have three or more sets of genetic
17
variation (3n or more).Crossing over and random segregation
during meiosishttps://en.wikipedia.org/wiki/Meiosis can result in the production of
new alleleshttps://en.wikipedia.org/wiki/Allele or new combinations of alleles. Furthermore,
random fertilization also contributes to variation.Variation and recombination can be facilitated
by transposable genetic elementhttps://en.wikipedia.org/wiki/Transposable_elements, endogenous
retroviruseshttps://en.wikipedia.org/wiki/Endogenous_retrovirus, LINEs, SINEs, etc.For a given
genome of a multicellular organism, genetic variation may be acquired in somatic cells or
inherited through the germline.
2.1.5 Forms
Genetic variation can be divided into different forms according to the size and type of genomic
variation underpinning genetic change. Small-scale sequence variation includes base-pair
substitutionhttps://en.wikipedia.org/wiki/Base-
pair_substitution and indelshttps://en.wikipedia.org/wiki/Indels. Large-scale structural
variationhttps://en.wikipedia.org/wiki/Structural_variation can be either copy number
variationhttps://en.wikipedia.org/wiki/Copy_number_variation (losshttps://en.wikipedia.org/wiki/D
eletion_(genetics) or gainhttps://en.wikipedia.org/wiki/Gene_duplication), or chromosomal
rearrangementhttps://en.wikipedia.org/wiki/Chromosomal_rearrangement (translocationhttps://en.
wikipedia.org/wiki/Chromosomal_translocation, inversionhttps://en.wikipedia.org/wiki/Chromosom
al_inversion, or Segmental
acquired uniparentaldisomyhttps://en.wikipedia.org/wiki/Uniparental_disomy).Numerical
variation in
whole chromosomeshttps://en.wikipedia.org/wiki/Chromosome or genomeshttps://en.wikipedia.or
g/wiki/Genome can be
either polyploidyhttps://en.wikipedia.org/wiki/Polyploidy or aneuploidyhttps://en.wikipedia.org/wi
ki/Aneuploidy.
2.1.6 Maintenance in populations
A variety of factors maintain genetic variation in populations. Potentially harmful recessive
alleles can be hidden from selection in
the heterozygoushttps://en.wikipedia.org/wiki/Zygosity individuals in populations
18
of diploidhttps://en.wikipedia.org/wiki/Ploidy organisms (recessive alleles are only expressed in
the less common homozygoushttps://en.wikipedia.org/wiki/Zygosity individuals). Natural
selection can also maintain genetic variation in balanced polymorphisms. Balanced
polymorphisms may occur when heterozygotes are favored or when selection is frequency
dependent. The only source of genetic variation in asexual organisms is mutations. Thus, if
replication of the genetic material was perfect then we would have no genetic variation, and
thus, no evolution. The importance of genetic variation is seen when the environment changes.
In such cases if genetic variation is not present then the prevalent genotypes might not be
suitable to the changed environment and the species might die out as there is no genetic
variation. If there is genetic variation then natural selection can act on it and bring about
adaptation to the new environment. In scientific terms. It's the degree by which progeny differs
from their parents. These are the differences found in morphological, physiological, cytological
and behaviouristic traits of individuals belonging to same species, race and family. They
appear in offsprings and siblings due to
• reshuffling of genes by chance separation of chromosome.
• crossing over
• chance combination of chromosomes during meiosis and fertilisation
• mutations
• effect of environment.
Water buffalo milk presents physicochemical features different from that of other ruminant
species, such as a higher content of fatty acids and proteins. The physical and chemical
parameters of swamp and river type water buffalo milk differ. Water buffalo milk contains
higher levels of total solids, crude protein, fat, calcium, and phosphorus, and slightly higher
content of lactose compared with those of cow milk. The high level of total solids makes water
buffalo milk ideal for processing into value-added dairy products such as cheese. The
conjugated linoleic acid (CLA) content in milk ranged from 4.4 mg/g fat in September to 7.6
mg/g fat in June. Seasons and genetics may play a role in variation of CLA level and changes
in gross composition of the water buffalo milk.Water buffalo milk is processed into a large
variety of dairy products:Cream churns much faster at higher fat levels and gives higher
overrun than cow cream.Butter from water buffalo cream displays more stability than that from
cow cream.Ghee from water buffalo milk has a different texture with a bigger grain size than
19
ghee from cow milk.Heat-concentrated milk products in the Indian subcontinent include
paneer, khoa, rabri, kheer and basundi.Fermented milk products include dahi, yogurt, and
chakka.
Water buffalo meat, sometimes called "carabeef", is often passed off as beef in certain regions,
and is also a major source of export revenue for India. In many Asian regions, buffalo meat is
less preferred due to its toughness; however, recipes have evolved (rendang, for example)
where the slow cooking process and spices not only make the meat palatable, but also preserve
it, an important factor in hot climates where refrigeration is not always available.Their hides
provide tough and useful leather, often used for shoes.Bone and horn products.Abihu dancer is
blowing a hornpipe.The bones and horns are often made into jewellery, especially earrings.
Horns are used for the embouchure of musical instruments, such as ney and kaval.
2.2Next Generation Sequencing
In the past, two general strategies have been widely used for whole genome sequencing: BAC
by BAC sequencing and shotgun sequencing. Both strategies employ the Sanger method,
which is relatively costly, time consuming, and labor intensive [31]. Therefore, the high
demand for low-cost sequencing has led to the development of high-throughput sequencing
technologies, called next-generation sequencing. As recently reviewed by Jiang et al. [32],
three such next-generation sequencing technologies have been commercialized, such as
Roche/454 life science (http://www.454.com), Illumina/ Solexa (http://www.Illumina.com) and
Applied Biosystem/SOLiD (http://solid.appliedbiosystems.com). These new generation
sequencing methods no longer use the Sanger method for sequencing. Instead, the 454
technology is based on pyrosequencing and emulsion PCR; the Solexa technology utilizes a
sequencing-by-synthesis approach for sequencing single DNA molecules attached to
microspheres and the SOLiD (supported oligonucleotide ligation and detection) technology is a
short-read sequencing method based on ligation. Nevertheless, these next-generation
sequencing methods can produce a large amount of sequences in a relatively shortime.
Sanger sequencing was developed by Frederick Sanger and colleagues in 1977 and was widely
used for about 25 years. Nowadays its mostly replaced by Next-gen sequencing. Sanger
20
sequencing is quite slow and can sequence only a few thousand nucleotides in a week. The
next-gen sequencing method is fast, easy to operate and cost effective. It can sequence about
200 billion nucleotides in a week, which is going to rise to 600 billion in the next few years. Its
like comparing the data stored in a floppy drive(Sanger sequence) to a 2TB hard drive(Next-
gen sequencing). The level of advancement that genome sequencing has undergone in the last
few years is so vast that now we can sequence the entire the genome of any individual within a
couple of hours.
Nucleic acid sequencing is a method for determining the exact order of nucleotides present in a
given DNA or RNA molecule. In the past decade, the use of nucleic acid sequencing has
increased exponentially as the ability to sequence has become accessible to research and
clinical labs all over the world. The first major foray into DNA sequencing was the Human
Genome Project, a $3 billion, 13-year-long endeavor, completed in 2003. The Human Genome
Project was accomplished with first-generation sequencing, known as Sanger sequencing.
Sanger sequencing (the chain-termination method), developed in 1975 by Edward Sanger, was
considered the gold standard for nucleic acid sequencing for the subsequent two and a half
decades (Sanger et al., 1977). Since completion of the first human genome sequence, demand
for cheaper and faster sequencing methods has increased greatly. This demand has driven the
development of second-generation sequencing methods, or nextgeneration sequencing (NGS).
NGS platforms perform massively parallel sequencing, during which millions of fragments of
DNA from a single sample are sequenced in unison. Massively parallel sequencing technology
facilitates high-throughput sequencing, which allows an entire genome to be sequenced in less
than one day. In the past decade, several NGS platforms have been developed that provide low-
cost, high-throughput sequencing. Here we highlight two of the most commonly used
platforms in research and clinical labs today: the LifeTechnologies Ion Torrent Personal
Genome Machine (PGM) and the IlluminaMiSeq. The creation of these and other NGS
platforms has made sequencing accessible to more labs, rapidly increasing the amount of
research and clinical diagnostics being performed with nucleic acid sequencing.
The growing power and reducing cost sparked an enormous range of applications of Next
generation sequencing (NGS) technology. Gradually, sequencing is starting to become the
standard technology to apply, certainly at the first step where the main question is “what's all
21
involved”, “what's the basis”. It should be realized that for many applications sequencing
would always have been the method of choice, yet it was science-fiction, technically
unthinkable and later possible but far too costly. We perform genome-wide association studies
(GWAS) using SNP-arrays simply because we cannot afford to perform wholegenome
sequencing in ten-thousands of individuals. This is changing rapidly and sequencing will
become our molecular microscope, the tool to get a first look. Although replication,
transcription, translation, methylation and nuclear DNA folding are completely different
processes, they can all be studied using sequencing. An important advantage of sequence data
is its quality, robustness and low noise. It should be noted that a successful NGS project
requires expertise both at the wet lab as well as the bioinformatics side in order towarrant high
quality data and data interpretation. The sequence itself is hard evidence of its correctness. A
sequencing system will not produce “random” sequences and when it does this becomes
evident immediately from QC calls obtained from spike-in controls. Furthermore random
sequences will have no match and can be easily discarded.
2.3 Single Nucleotide Polymorphisms Identification
Single nucleotide polymorphisms (SNPs) are
stable, biallelichttp://bioinformatica.upf.edu/2002/projects/4.1/definitions.html - bial sequence
variants that are distributed throughout the genome and are present at an appreciable frequency
(>1%) in human population. With other types of polymorphism, like insertions or deletions,
they cause part of genome variation among individuals. Nevertheless, the biggest part of this
sequence variation is attributable to them.
There are three main reasons to identify SNPs:
• Some of them may be involved in diseases due to the mutation they cause.
• Even SNPs that do not change protein expression may be close to deleterious and
unknown mutations on the same chromosome. Thus, they might be used as markers to
make broad haplotypehttp://bioinformatica.upf.edu/2002/projects/4.1/definitions.html -
haplo analysis.
• Their low rate of recurrent mutations makes them stable indicators of human history.
SNPs might be used to reveal the connections between human beings through the time.
22
The principal projects involved in this identification are the SNP Consortium (95% of the
sequenced SNPs has been done by it) and the International Human Genome Sequencing
Consortium. Single-nucleotide polymorphisms may fall within coding sequences
of genes, non-coding regions of genes, or in the intergenic regions (regions between genes).
SNPs within a coding sequence do not necessarily change the amino acid sequence of
the protein that is produced, due to degeneracy of the genetic code.Association studies can
determine whether a genetic variant is associated with a disease or trait.[6]
A tag SNP is a representative single-nucleotide polymorphism (SNP's) in a region of the
genome with high linkage disequilibrium (the non-random association of alleles at two or more
loci). Tag SNPs are useful in whole-genome SNP association studies in which hundreds of
thousands of SNPs across the entire genome are genotyped.
Haplotype mapping: sets of alleles or DNA sequences can be clustered so that a single SNP
can identify many linked SNPs.Linkage Disequilibrium (LD), a term used in population
genetics, indicates non-random association of alleles at two or more loci, not necessarily on the
same chromosome. It refers to the phenomenon that SNP allele or DNA sequence which are
close together in the genome tend to be inherited together. LD is affected by two parameters: 1)
The distance between the SNPs [the larger the distance the lower the LD]. 2) Recombination
rate [the lower the recombination rate the higher the LD].
CHAPTER-III
MATERIALS AND METHOD
23
Materials Used
The water buffalo Traits : Milk yield, Lactation period and Age at first calving had collected
And the reference genome of Cattle (GCA_000003055.5_bos_taurus_UMD_3.1.1
_genomic(1).fna)have downloaded from (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000
/003/055/GCA_000003055.5_Bos_ taurs_UMD_3.1.1/) NCBI. Every individual sample
demultiplexed forward and reverse FASTQ files in the analysis (ddRAD sequence data only).
A simple naming convention (a single-word localitycode/name and a single-word sample
identifier separated by an underscore) must be followed for every sample; examples
are1_R1_001.fastq and 1_R2_001.fastq. A sample script for using a text file containing sample
names and process radtagsfrom Stacks to properly demultiplex samples and put them in the
proper naming convention.
3.1Double Digest Restriction-Site Associated DNA Sequencing
The double digest restriction-site associated DNA sequencing technology (ddRAD-sequence)
is a reduced representation sequencing technology by sampling genome-wide enzyme loci
developed on the basis of next-generation sequencing. ddRAD-sequence has been widely
applied to SNP marker development and genotyping onanimals, especially on marine animals
as the original ddRAD protocol is mainly built and trained based on animal data. However,
wide application of ddRAD-sequence technology in plant species has not been achieved so far.
Here, we aim to develop an optimized ddRAD library preparation protocol be accessible to
most buffalo species without much startup pre-experiment and costs. Double digest RAD
sequencing (ddRADsequence), by contrast, uses a two enzyme double digest followed by
precise size selection that excludes regions flanked by either [a] very close or [b] very distant
RE recognition sites, recovering a library consisting of only fragments close to the target size
(red segments).
Table. 1 ddRAD sequence
Names
(Restrict on enzyme) Sequence (5′ – 3′)
24
Names
(Restrict on enzyme) Sequence (5′ – 3′)
TCTTTCCCTACACGACGCTCTTCCGATCTGCA
PstI GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
CTGGAGTTCAGACGTGTGCTCTTCCGATCT
EcoRI AATTAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
TCTTTCCCTACACGACGCTCTTCCGATCT
HindIII AGCTAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
CTGGAGTTCAGACGTGTGCTCTTCCGATC
SalI TCGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
CTGGAGTTCAGACGTGTGCTCTTCCGATCT
MspI CGAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
Indexed primers for PCRa
Forward primer
AATGATACGGCGACCACCGAGATCTACACXXXXXXXXACACTCTTTCCCTACACGACGCTCTTCC
Reverse primer
CAAGCAGAAGACGGCATACGAGATXXXXXXXXGTGACTGGAGTTCAGACGTGTGCTCTTC
Genotyping requires thousands of genomes to be compare in a reliable, consistent way.
Restriction site associated DNA sequencing (RAD-Sequence) interrogates a fraction of
the genome across many individuals, an ideal method for genotyping. By using restriction
enzyme digestion and sequencing the regions adjacent to restriction sites, researchers can
examine the same subset of genomic regions for thousands of individuals and identify
many genetic markers along the genome. Other NGS methods examine a larger portion
25
of the genome and offer more data, but they are costly and cannot be used to study the
thousands of individuals required for genotyping.RAD-Sequence applications include:
Genetic marker discovery, Local genome assembly, QTL mapping, Linkage mapping.
SciGenom uses double digest RAD-Sequence (ddRAD-Sequence), a variation of RAD-
Sequence, for genotyping. Traditional RAD-Sequence uses one restriction enzyme and
random shearing to generate fragments from genomic DNA. However, these are high
DNA loss steps and offer little control over the fragments that are sequenced. For
organisms without a reference genome, a significant portion of the RAD-Sequence data has
been discarded due to sequence read errors and the presence of variable sites. ddRAD-
Sequence was designed to address RAD-Sequence short-comings. In ddRAD-Sequence,
genomic DNA is digested with two restriction enzymes, and the resulting fragments
undergo adaptor ligations and precise size selection before sequencing. Only a very
small fraction of the fragments will be sequenced. These fragments are naturally
selected to be from the same genomic regions across individuals. Further, ddRAD requires
half as many reads to achieve high confidence SNP calling, because the chance of obtaining
duplicate reads from the same restriction site are very low. Due to these modifications,
ddRAD has become a more economical method to genotype thousands of individuals,
and has been used for SNP discovery between two Peromyscus species that have no reference
sequence.
3.1.1Paired-End ddRAD-Sequence
ddRAD-sequence paired-end data. Each pair of R1 and R2 is gapped by a known amount +/-
approximately 90bp, so linking these reads as haplotypes would be ideal. The 90bp flop is not
critically important and each R1/R2 pair could be concatenated with an appropriate number of
Ns. Then each full locus could be run through the pipeline as a coherent whole.
RAD Vs ddRAD
RAD :reads between the restriction site and a random site.
ddRAD:reads between the 2 restric2on sites. So more flexibility on the balance coverage /
depth of coverage.
26
Fig1. Difference between RAD and ddRAD Sequencing
Double-digest restriction site-associated DNA sequencing (ddRAD-Sequence) enables high-
throughput genome-wide genotyping with next-generation sequencing technology.
computational in silico prediction of restriction sites from the genome sequence is recognized
as an effective approach for choosing the restriction enzymes to be used, few reports have
evaluated the in silico predictions in actual experimental data. In this study, we designed and
demonstrated a workflow for in silico and empirical ddRAD-Sequence analysis in Buffalo, as
follows: (i) in silico prediction of optimum restriction enzymes from the reference genome, (ii)
verification of the prediction by actual ddRAD-Sequence data of four restriction enzyme
combinations, (iii) establishment of a computational data processing pipeline for high-
confidence single nucleotide polymorphism (SNP) calling, and (iv) validation of SNP accuracy
by construction of genetic linkage maps. The quality of SNPs based on de novo assembly
reference of the ddRAD-Sequence reads was comparable with that of SNPs obtained using the
published reference genome of Cattle. Comparisons of SNP calls in diverse Buffalo lines
revealed that SNP density in the genome influenced the detectability of SNPs by ddRAD-
Sequence.
27
SciGenome uses double digest RAD-Sequence (ddRAD-Sequence), a variation of RAD-
Sequence, for genotyping. Traditional RAD-Sequence uses one restriction enzyme and random
shearing to generate fragments from genomic DNA. However, these are high DNA loss steps
and offer little control over the fragments that are sequenced. For organisms without a
reference genome, a significant portion of the RAD-Sequence data has been discarded due to
sequence read errors and the presence of variable sites. ddRAD-Sequence was designed to
address RAD-Sequence short-comings. In ddRAD-Sequence, Genomic DNA is digested with
two restriction enzymes, and the resulting fragments undergo adaptor ligations and precise
size selection before sequencing. Only a very small fraction of the fragments will be
sequenced. These fragments are naturally selected to be from the same genomic regions across
individuals undergo adaptor ligations and precise size selection before sequencing. Only a very
smallfraction of the fragments will be sequenced. These fragments are naturally selected to be
from the same genomic regions across individuals. Further, ddRAD requires half as many
reads to achieve high confidence SNP calling, because the chance of obtaining duplicate reads
from the same restriction site are very low. The Stacks pipeline uses RAD-Sequence data to
create genetic maps and conduct population analysis. It assembles loci de novo from an
individual’s sequence reads or by using a reference sequence. These loci are catalogued and
compared against other individuals’ loci to create a map of alleles. Stacks can identify
thousands of markers and use this information to study genomic structure and assembly. Stacks
can export data to JoinMap, R/gtl and VCF formats.In addition to Stacks, SciGenom has the
ability to use GATK, MUSCLE, MCL and BLAST in the analysis pipline.
3.2 FastQC
Modern high throughput sequencers can generate tens of millions of sequences in a single run. Before
analysing this sequence to draw biological conclusions you should always perform some simple quality
control checks to ensure that the raw data looks good and there are no problems or biases in your data
which may affect how you can usefully use it. Most sequencers will generate a QC report as part of
their analysis pipeline, but this is usually only focused on identifying problems which were generated
by the sequencer itself. FastQC aims to provide a QC report which can spot problems which originate
either in the sequencer or in the starting library material. FastQC can be run in one of two modes. It can
either run as a standalone interactive application for the immediate analysis of small numbers of FastQ
28
files, or it can be run in a non-interactive mode where it would be suitable for integrating into a larger
analysis pipeline for the systematic processing of large numbers of files.
3.2.1 Opening a Sequence file
To open one or more Sequence files interactively simply run the program and select File >
Open. You can then select the files you want to analyse. Newly opened files will immediately
appear in the set of tabs at the top of the screen. Because of the size of these files it can take a
couple of minutes to open them. FastQC operates a queueing system where only one file is
opened at a time, and new files will wait until existing files have been processed.
FastQC supports files in the following formats
• FastQ (all quality encoding variants)
• CasavaFastQ files*
• ColorspaceFastQ
• GZip compressed FastQ
• SAM
• BAM
• SAM/BAM Mapped only (normally used for colorspace data)
I have used FastQ file in FastQC.
3.2.2Evaluating Results
The analysis in FastQC is performed by a series of analysis modules. The left hand side of the
main interactive display or the top of the HTML report show a summary of the modules which
were run, and a quick evaluation of whether the results of the module seem entirely normal
(green tick), slightly abnormal (orange triangle) or very unusual (red cross).
It is important to stress that although the analysis results appear to give a pass/fail result, these
evaluations must be taken in the context of what you expect from your library. A 'normal'
sample as far as FastQC is concerned is random and diverse. Some experiments may be
expected to produce libraries which are biased in particular ways. You should treat the
summary evaluations therefore as pointers to where you should concentrate your attention and
29
understand why your library may not look random and diverse. Specific guidance on how to
interpret the output of each module can be found in the modules section of the help.
3.2.3 FastQC Report
The analysis in FastQC is performed by a series of analysis modules. The left hand side of the
main interactive display or the top of the HTML report show a summary of the modules which
were run, and a quick evaluation of whether the results of the module seem entirely normal
(green tick), slightly abnormal (orange triangle) or very unusual (red cross). Quality check is
done for the following parameters:-
• Basic Statistics: The Basic Statistics module generates some simple composition
statistics for the file analysed. Basic Statistics never raises a warning.It never raises an
error.
Table. 2 Basic Statistics
Measure Value
Filename 1_R1_001.fastq
File type Conventional base calls
Encoding Sanger / Illumina 1.9
Total Sequences 172201
Sequences flagged as poor quality 0
Sequence length 100
%GC 46
Filename 1_R1_001.fastq
File type Conventional base calls
• Per base sequence quality: This view shows an overview of the range of quality
values across all bases at each position in the FastQ file.The y-axis on the graph shows
the quality scores. The higher the score the better the base call. The background of the
graph divides the y axis into very good quality calls (green), calls of reasonable quality
(orange), and calls of poor quality (red).
30
Fig2 . Per base sequence quality
Fig3. Per tile sequence quality
• Per sequence quality scores: It is often the case that a subset of sequences will have
universally poor quality, often because they are poorly imaged (on the edge of the field
of view etc), however these should represent only a small percentage of the total
31
sequences.A warning is raised if the most frequently observed mean quality is below 27
- this equates to a 0.2% error rate.An error is raised if the most frequently observed
mean quality is below 20 - this equates to a 1% error rate.
Fig 4. Per sequence quality scores
• Per base sequence content: Per Base Sequence Content plots out the proportion of
each base position in a file for which each of the four normal DNA bases has been
called.This module issues a warning if the difference between A and T, or G and C is
greater than 10% in any position.This module will fail if the difference between A and
T, or G and C is greater than 20% in any position.
32
Fig5. Per base sequence content
• Per sequence GC
content:
Fig6. Per sequence GC content
33
This module measures the GC content across the whole length of each sequence in a file and
compares it to a modelled normal distribution of GC content. A warning is raised if the sum of
the deviations from the normal distribution represents more than 15% of the reads.This module
will indicate a failure if the sum of the deviations from the normal distribution represents more
than 30% of the reads.
• Per base N content: If a sequencer is unable to make a base call with sufficient
confidence then it will normally substitute an N rather than a conventional base] call
.This module plots out the percentage of base calls at each position for which an N was
called. This module raises a warning if any position shows an N content of >5%.This
module will raise an error if any position shows an N content of >20%.
Fig7. Per base N content
• Sequence Length Distribution: In many cases this will produce a simple graph
showing a peak only at one size, but for variable length FastQ files this will show the
relative amounts of each different size of sequence fragment.This module will raise a
warning if all sequences are not the same length.This module will raise an error if any
of the sequences have zero length.
34
Fig8. Sequence length distribution
• Sequence Duplication Levels: This module counts the degree of duplication for every
sequence in the set and creates a plot showing the relative number of sequences with
different degrees of duplication.This module will issue a warning if non-unique
sequences make up more than 20% of the total.This module will issue a error if non-
unique sequences make up more than 50% of the total.
35
Fig 9. Sequence duplication levels
• Overrepresented sequences: This module lists all of the sequence which make up
more than 0.1% of the total. To conserve memory only sequences which appear in the
first 200,000 sequences are tracked to the end of the file. It is therefore possible that a
sequence which is overrepresented but doesn't appear at the start of the file for some
reason could be missed by this module.This module will issue a warning if any
sequence is found to represent more than 0.1% of the total.This module will issue an
error if any sequence is found to represent more than 1% of the total.
36
Table. 3 Overrepresented sequences
Sequence Count Percentage Possible Source
ATAGAGGCCAGCGGTAGATCGG
AAGAGCACACGTCTGAACTCCA
GTCACT
1420 0.82461774321
86805
Illumina Multiplexing
PCR Primer 2.01
(100% over 34bp)
ATAGAGGCCATGCCTCTCTAGT
TCTTCAAGGGATGACAGGACAC
TTGTCG
795 0.46166979285
83458 No Hit
ATAGAGGCCATGCCAGGCCTCC
CTGTCCATCACCAACTCCCGGA
GTTCAC
399 0.23170597151
003766 No Hit
ATAGAGGCCATGCATTGGAGAA
GGAAATGGCAACCCACTCCAGT
GTTCTT
331 0.19221723451
083328 No Hit
ATAGAGGCCATGCTAACTAGTT
ACGCGACCCCCGAGCGGTCGGC
GTCCCC
279 0.16201996504
085342 No Hit
ATAGAGGCCATGCTGCGATTCA
TGGGGTCGCAAAGAGTCGGACA
CGACTG
206 0.11962764443
876632 No Hit
ATAGAGGCCATGCCAGGCCTCC
CTGTCCATCACCAACTCCCAGA
GTTCAC
203 0.11788549427
703672 No Hit
37
• Adapter Content
Fig10. Adapter content
• Kmer Content
Fig11. Kmer content
38
This module counts the enrichment of every 5-mer within the sequence library. It calculates an
expected level at which this k-mer should have been seen based on the base content of the
library as a whole and then uses the actual count to calculate an observed/expected ratio for
that k-mer. In addition to reporting a list of hits it will draw a graph for the top 6 hits to show
the pattern of enrichment of that Kmer across the length of your reads. This will show if you
have a general enrichment, or if there is a pattern of bias at different points over your read
length.
3.3 Trimming
After FastQC checks that it is recognizing the proper number of samples in the current
directory, after that to proceed with quality trimming of sequence data. Trimmomatic-vo.32
for double-digest RAD adapters and trims bases with quality scores PHRED +33 or PHRED
+64. The read mapping and variant calling steps of STACKSaccount for base quality, so
minimal trimming of the data is needed. Typically, quality trimming only needs to be
performed once. Trimmomatic is a fast, multithreaded command line tool that can be used to
trim and crop Illumina (FASTQ) data as well as to remove adapters. These adapters can pose a
real problem depending on the library preparation and downstream application.
There are two major modes of the program: Paired end mode and Single end mode. The paired
end mode will maintain correspondence of read pairs and also use the additional information
contained in paired reads to better find adapter or PCR primer fragments introduced by the
library preparation process. Trimmomatic works with FASTQ files (using phred + 33 or phred
+ 64 quality scores, depending on the Illumina pipeline used). Files compressed using either
„gzip‟ or „bzip2‟ are supported, and are identified by use of „.gz‟ or „.bz2‟ file extensions.
Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single
ended data. The selection of trimming steps and their associated parameters are supplied on the
command line – (java -jar /opt/software/Trimmomatic-0.32/trimmomatic-0.32.jar PE
1_R1_001.fastq 1_R2_001.fastq 1_R1_forward_paired.fastq 1_R1_forward_unpaired.fastq
1_R2_reverse_paired.fastq 1_R2_reverse_unpaired.fastq ILLUMINACLIP:TrueSeq3-
PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36).
39
3.3.1Trimmomatic Report
Paired End Mode: For paired-end data, two input files, and 4 output files are specified, 2 for
the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired'
output where a read survived, but the partner read did not.
Fig12. Trimming report
3.4. Indexing
The reference genome must first be "indexed" through “bowtie2-build”. From the directory
containing the genome.fna file, run the "bowtie2-build"command (/opt/software/
bowtie2/bowtie2-buildGCA_000003055.5_Bos_taurus_UMD_3.1.1_genomic\(1\). fna output).
This command will create 6 files with a *.bt2 file extension. These will then be used by
Bowtie 2 to map data.
3.5. Sequence Alignment Map
40
I have mapped my trimming result FASTQ file to the reference genome through bowtie2-align-
s using command (/opt/software/bowtie2/bowtie2-align-s -x output -1 1_R1_forward
_paired.fastq -2 1_R2_reverse_paired.fastq -S samfile_1.sam), i will normally end up with a
SAM alignment file. SAM stands for Sequence Alignment/Map format, and BAM is the
binary version of a SAM file. Sequence Alignment Map (SAM) is a text-based format for
storing biological sequences aligned to a reference sequence developed by Heng Li.[1] It is
widely used for storing data, such as nucleotide sequences, generated by Next generation
sequencing technologies. "The format supports short and long reads (up to 128Mbp) produced
by different sequencing platforms and is used to hold mapped data within the GATK and
across the Broad Institute, the Sanger Centre, and throughout the 1000 Genomes project.
Sequence Alignment/Map (SAM) format for alignment of nucleotide sequences (e.g.
sequencing reads) to (a) reference sequence(s). May contain base-call and alignment qualities
and other data." [2]
The SAM format consists of a header and an alignment section.[1]The binary representation of
a SAM file is a BAM file, which is a compressed SAM file.[1] SAM files can be analysed and
edited with the software SAMtools.[1] The header section must be prior to the alignment section
if it is present. Heading's begin with the '@' symbol, which distinguishes them from the
alignment section. Alignment sections have 11 mandatory fields, as well as a variable number
of optional fields.The SAM flag is a little more difficult to decipher - the value of the flag is
formulated as a bitwise flag, with each binary bit corresponding to a certain parameter. See the
format specification for more info . For example, if the 0x10 bit is set, then the read is aligned
as the reverse compliment (i.e. maps to the - strand). Usually, the process of removing
duplicate reads or removing non-unique alignments is handled by the downstream analysis
program.
3.6. Stacks Pipeline
Several molecular approaches have been developed to focus short reads to specific, restriction-
enzyme anchored positions in the genome. Reduced representation techniques such as CRoPS,
RAD-seq, GBS, double-digest RAD-seq, and 2bRAD effectively subsample the genome of
multiple individuals at homologous locations, allowing for single nucleotide polymorphisms
(SNPs) to be identified and typed for tens or hundreds of thousands of markers spread evenly
41
throughout the genome in large numbers of individuals. This family of reduced representation
genotyping approaches has generically been called genotype-by-sequencing (GBS) or
Restriction-site Associated DNA sequencing (RAD-seq). Stacks is designed to work with any
restriction-enzyme based data, such as GBS, CRoPS, and both single and double digest RAD.
Stacks is designed as a modular pipeline to efficiently curate and assemble large numbers of
short-read sequences from multiple samples. Stacks identifies loci in a set of individuals, either
de novo or aligned to a reference genome (including gapped alignments), and then genotypes
each locus. Stacks incorporates a maximum likelihood statistical model to identify sequence
polymorphisms and distinguish them from sequencing errors. Stacks employs a Catalog to
record all loci identified in a population and matches individuals to that Catalog to determine
which haplotype alleles are present at every locus in each individual.
Stacks is implemented in C++ with wrapper programs written in Perl. The core algorithms are
multithreaded via OpenMP libraries and the software can handle data from hundreds of
individuals, comprising millions of genotypes. Stacks incorporates a MySQL database
component linked to a web front end that allows efficient data visualization, management and
modification.
Stacks proceeds in five major stages:
• First, reads are demultiplexed and cleaned by the process_radtags program
• The next three stages comprise the main Stacks pipeline
• building loci (ustacks/pstacks), creating the catalog of loci (cstacks)
• And matching against the catalog (sstacks)
• In the fifth stage, either the populations or genotypes program is executed, depending
on the type of input data. So according to my data I have executed population program
.This flow is diagrammed in the following figure.
42
Fig13. Stacks diagram
The goal in Stacks is to assemble loci in large numbers of individuals in a population or genetic
cross, call SNPs within those loci, and then read haplotypes from them. Therefore Stacks wants
data that is a uniform length, with coverage high enough to confidently call SNPs. Although it
is very useful in other bioinformatic analyses to variably trim raw reads, this creates loci that
have variable coverage, particularly at the 3’ end of the locus. In a population analysis, this
43
results in SNPs that are called in some individuals but not in others, depending on the amount
of trimming that went into the reads assembled into each locus, and this interferes with SNP
and haplotype calling in large populations.
3.6.1.Protocol Type
Stacks supports all the major restriction-enzyme digest protocols such as RAD-seq, double-
digest RAD-seq, and GBS, among others. For double-digest RAD data that has been paired-
end sequenced, Stacks supports this type of data by treating the loci built from the single-end
and paired-end as two independent loci. In the near future, we will support merging these two
loci into a single haplotype.
3.6.2.Sequencer Type
Stacks is optimized for short-read, Illumina-style sequencing. There is no limit to the length the
sequences can be, although there is a hard-coded limit of 1024bp in the source code now for
efficency reasons, but this limit could be raised if the technology warranted it. Stacks can also
be used with data produced by the Ion Torrent platform, but that platform produces reads of
multiple lengths so to use this data with Stacks the reads have to be truncated to a particular
length, discarding those reads below the chosen length. The process_radtags program can
truncate the reads from an Ion Torrent run. Other sequencing technologies could be used in
theory, but often the cost versus the number of reads obtained is prohibitive for building stacks
and calling SNPs.
3.6.3.paired-End Reads
Stacks does not directly support paired-end reads where the paired-end read is not anchored by
a second restriction enzyme. In the case of double-digest RAD, both the single-end and paired-
end read are anchored by a restriction enzyme and can be assembled as independent loci. In
cases such as with the RAD protocol, where the molecules are sheared and the paired-end
therefore does not stack-up, cannot be directly used. However, they can be indirectly used by
say, building contigs out of the paired-end reads that can be used to build phylogenetic trees or
to identify orthologous genes and Stacks includes some tools to help do that.
44
3.6.4.Run The Pipeline
The simplest way to run the pipeline is to use one of the two wrapper programs provided: if
you do not have a reference genome you will use denovo_map.pl, and if you do have a
reference genome use ref_map.pl. In each case you will specify a list of samples that you
demultiplexed in the first step to the program, along with several command line options that
control the internal algorithms. So I had a reference genome and used ref_map.pl.ref_map.pl
expects data that has been aligned to a reference genome, and accepts either SAM or BAM
files.
3.7. ref_map pipeline
The ref_map.pl program will execute the Stacks pipeline by running each of the Stacks
components individually. It is the simplest way to run Stacks and it handles many of the
details, such as sample numbering and loading data to the MySQL database, if desired.
The ref_map.pl program expects data to have been aligned to a reference genome, and can
accept data from any aligner that can produce SAM or BAM formated files. The program
performs several stages, including:
• Running pstacks on each of the samples specified, assembling loci according to the
alignment positions provided for each read, and calling SNPs in each sample.
• Executing cstacks to create a catalog of all loci across the population (or from just the
parents if processing a genetic map). Loci from different samples are matched up across
the data set according to alignment position.
• Next, sstacks will be executed to match each sample against the catalog. In the case of a
genetic map, the parents and progeny are matched against the catalog.
• In the case of a population analysis, the populations program will be run to generate
population-level summary statistics. If you specified a population map (-O option) it
will be supplied to populations. If you are analyzing a genetic map,
the genotypes program will be executed to generate a set of markers and a set of initial
genotypes for export to a linkage mapping program.
• Computation is now complete. If database interaction is enabled, ref_map.pl will
upload the results of each stage of the analysis: individual loci, the catalog, matches
against the catalog, and genotypes or sumamry statistics into the database.
45
• Lastly, if database interaction is enabled, index_radtags.pl will be run to build a
database index to speed up access to the database and enable web-based filtering.
After create SAM file, then run the stacks piplineref_map.plusing parameter (-b,-B,-s,-
o).b=batch id,B=load data into mysqldatabase,s=individual sample,o=output. ref_map.pl will
execute the pipeline, running pstacks instead of ustacks, taking the aligned reads as assembled
loci and calling SNPs in each locus. ref_map.pl then runs the rest of the pipeline in the same
way, however, the -g option is provided to cstacks and sstacks to cause their matching
algorithms to match on genomic location, not sequence similarity.
Output:
• Building loci: Generates 3 files per sample: – sample_alleles.tsv
– sample_ snps.tsv – sample_ tags.tsv
• Cataloguing of observed SNPs: – batch_1001.catalog.alleles.tsv – batch_1001.catalog.snps.tsv – batch_1001.catalog.tags.tsv
• Verifying individual samples against catalogue – batch_1001.catalog.matches.tsv
I have the database and web interface installed (MySQL and the Apache Webserver) then
ref_map.pl can upload the output from the pipeline to the database for viewing in a web
browser.
3.8 The Stacks Web Interface
To visualize data, Stacks uses a web-based interface (written in PHP) that interacts with a
MySQL database server. MySQL provides various functions to store, sort, and export data
from a database.The output from the Stacks pipeline is meant to be loaded into
a MySQL database and viewed online, facilitating data mining, and data correction. A database
schema is provided along with a set of PHP files to display the results of the Stacks pipeline,
resulting in an interface like this.
46
CHAPTER IV
RESULTS AND DISCUSSION
RESULTS
Double-digest RAD data from buffalo genome of three traits to identification & annotation of
genetic variantwith Stacks, the first generally available, widely used pipeline for analysis of
ddRADseq data. The goal in Stacks is to assemble loci in large numbers of individuals in a
population or genetic cross, call SNPs within those loci, and then read haplotypes from them.
Therefore Stacks wants data that is a uniform length, with coverage high enough to confidently
call SNPs. Although it is very useful in other bioinformatic analyses to variably trim raw reads,
this creates loci that have variable coverage, particularly at the 3’ end of the locus. In a
population analysis, this results in SNPs that are called in some individuals but not in others,
depending on the amount of trimming that went into the reads assembled into each locus, and
this interferes with SNP and haplotype calling in large populations.
47
Table. 4 Milk yield trait
Id Type Unique
Stacks
Polymorphic
Loci
SNPs
Found Source
1 Sample 1240 6 12 samfile_1
2 Sample 19789 427 488 samfile_24
3 Sample 25654 525 609 samfile_27
4 Sample 15046 322 364 samfile_32
5 Sample 36542 859 1077 samfile_40
6 Sample 23564 502 633 samfile_42
7 Sample 11437 220 272 samfile_60
8 Sample 61262 1975 2311 samfile_63
9 Sample 42941 1269 1466 samfile_68
10 Sample 23460 439 520 samfile_73
11 Sample 73940 2688 3242 samfile_75
12 Sample 46808 1241 1541 samfile_25
13 Sample 16216 261 307 samfile_28
14 Sample 24744 463 562 samfile_29
15 Sample 59261 1822 2140 samfile_35
16 Sample 53680 1879 2205 samfile_36
17 Sample 52856 1536 1855 samfile_37
18 Sample 31352 771 914 samfile_39
19 Sample 38852 936 1070 samfile_59
20 Sample 1159 12 14 samfile_61
21 Sample 3237 42 52 samfile_62
22 Sample 1624 13 14 samfile_64
23 Sample 55711 1857 2351 samfile_67
24 Sample 39765 1077 1286 samfile_71
25 Sample 23716 423 497 samfile_74
(Milk yield trait samples, SNPs found an individual samples and total SNPs-“25802” found
from 25 samples.It’s Unique stacks id and polymorphic loci also given.)
48
Table. 5 Marker
(Above this fig. calculated total genoytpes, genotypefrequencies, Mean log likelihood and
Genotype Map. )
SNPs: (The sequence type is primary.Stacks depth=5x means that number or reads contained
in the locus that matched to the catalog.SNPs found at particular position of the sequence.At
column 4,5,6,7 found 4-nucleotide CATG& called as haplotype.AGGCis minor alleles and
49
CATG is major alleles. Deleveraged Flag If "1", this stack was processed by the
deleveraging algorithm and was broken down from a larger stack.Blacklisted Flag If "1", this
stack was still confounded depsite processing by the deleveraging algorithm.Lumberjackstack
Flag If "1", this stack was set aside due to having an extreme depth of coverage.)
Fig14. SNPs
50
Haplotypes: (Two sample i.e. samfile1 & samfile67 matches.Genotype
frequencya,b:2(100.0%),Find two haplotype& showing alleles for each particular
column.chr:GK000005,102.47Mb, + ,LnL:-24.085).
Fig15. Haplotypes
51
Table. 6 Lactation period trait
Id Type Unique
Stacks
Polymorphic
Loci
SNPs
Found Source
1 Sample 1240 6 12 samfile_1
2 Sample 4789 68 76 samfile_2
3 Sample 9200 145 166 samfile_3
4 Sample 13724 236 278 samfile_4
5 Sample 8096 116 142 samfile_5
6 Sample 6139 90 106 samfile_6
7 Sample 19904 451 527 samfile_7
8 Sample 17745 342 418 samfile_8
9 Sample 6516 99 111 samfile_16
10 Sample 12879 248 282 samfile_65
11 Sample 36858 1033 1213 samfile_70
12 Sample 11922 186 219 samfile_9
13 Sample 2242 14 16 samfile_10
14 Sample 5501 64 92 samfile_11
15 Sample 20139 434 511 samfile_12
16 Sample 2613 23 29 samfile_13
17 Sample 8351 97 116 samfile_14
18 Sample 16160 327 376 samfile_15
52
19 Sample 39765 1077 1286 samfile_71
20 Sample 73940 2688 3242 samfile_75
(Lactation period trait samples, SNPs found an individual samples and total SNPs-“9218”
found from 20 samples it’s Unique stacks id, polymorphic loci also given.)
SNPs: (Identify four SNPs from individual sequence.)
Fig 16. SNPs
Haplotypes: (Identify twohaplotypes,chr:GK000005.2,102.47Mb, + ,LnL:-16.48 Genotype
frequency aa:1(50.0%) &ab:1(50.0%).Two sample i.esamfile 1 &samfile 8 matches.)
53
Fig 17. Haplotypes
Table. 7 SNP Summary Statistics
SNP Summary Statistics
Pop BP Colum
n
Allel
e 1
Allel
e 2 P
N
Obs
Het
ObsHo
m
Exp
Het
ExpHo
m π
FI
S
1
.
defaultpo
p
10247154
9 4 A C
0.5000
0 1
1.00
0 0.000
0.50
0 0.500
1.00
0 0
2
.
defaultpo
p
10247155
0 5 A G
0.5000
0 1
1.00
0 0.000
0.50
0 0.500
1.00
0 0
3
.
Defaultpo
p
10247155
1 6 G T
0.5000
0 1
1.00
0 0.000
0.50
0 0.500
1.00
0 0
4
.
defaultpo
p
10247155
2 7 C G
0.5000
0 1
1.00
0 0.000
0.50
0 0.500
1.00
0 0
54
(Pop=population,BP=Base pair if aigned to a reference genome this is the base pair for the
particular SNP.P= Mean frequency of the most frequent allele at each locus in this
population.N= Number of alleles/haplotypes present at this locus.Obs Het=Mean observed
heterozygosity in this population.ObsHom=Mean observed homozygosity in this
population.Exp Het=Mean expected heterozygosity in this population.ExpHom=Mean
expected homozygosity in this population.π = an estimate of nucleotide diversity.Fis=The
inbreeding coefficient of an individual (I) relative to the subpopulation (S).)
Table. 8 Haplotype Summary Statistics
Haplotype Summary Statistics
Pop BP N Haplotype Cnt Gene Diversity Haplotype Diversity
1. defaultpop 102471545 4 2 0.500 2.000
(N= Number of alleles/haplotypes present at this locus.Haplotype Cnt= Raw number of reads
that have this haplotype.Gene Diversity=is the proportion of polymorphic loci across the
genome.Haplotype Diversity=It is controlled by a variety of process, including
mutation,recombination,marker as certainment and demography.)
Table. 9 Hapstats Summary Statistics
#
Ba
tc
h
ID
Loc
us
ID
Chr BP Pop
ID
N Hapl
otyp
e Cnt
Gen
e
Dive
rsity
Smo
othe
d
Gene
Diver
sity
Smo
othe
d
Gene
Diver
sity
P-
value
Hapl
otyp
e
Diver
sity
Smo
othe
d
Hapl
otyp
e
Diver
sity
Smo
othe
d
Hapl
otyp
e
Diver
sity
P-
value
Haplotyp
es
4
137
595
GJ058
424.1
300
72
defau
ltpop 6 3
0.73
33 0 0 1.4 0 0
CCT:2;TA
C:1;TAT:3
4
529
00
GJ058
425.1
107
783
defau
ltpop 4 2
0.66
67 0 0 2 0 0
ATA:2;CC
T:2
4
804
69
GJ058
435.1
654
66
defau
ltpop
1
0 2
0.53
33 0 0 1.6 0 0
CCT:4;TA
G:6
4
270
18
GJ058
435.1
655
43
defau
ltpop 8 2
0.53
57 0 0
0.53
57 0 0 A:5;G:3
55
4
804
70
GJ058
439.1
420
5
defau
ltpop 6 2
0.53
33 0 0
2.13
33 0 0
ACCG:2;G
TGT:4
4
529
04
GJ058
444.1
491
76
defau
ltpop 2 2 1 0 0 1 0 0 C:1;T:1
4
529
05
GJ058
444.1
493
19
defau
ltpop 2 2 1 0 0 1 0 0 A:1;G:1
Table. 10 Sumstats.Summary statistics
(Calculating summary statistics, such as heterozygosity, π, and FIS.)
56
Table. 11 Sumstats_Summary.Summary Statistics
(There are two tables in this file containing the same headings. The first table, labeled
"Variant" calculated these values at only the variable sites in each population. The second
table, labeled "All positions" calculted these values at all positions, both variable and fixed, in
each population.)
Table. 12 Marker
57
Table. 13 Age at first Calving trait
Id Type Unique
Stacks
Polymorphic
Loci
SNPs
Found Source
1 Sample 24375 594 723 samfile_17
2 Sample 5579 87 99 samfile_19
3 Sample 17167 345 389 samfile_47
4 Sample 40380 1034 1205 samfile_48
5 Sample 19424 374 441 samfile_49
6 Sample 36098 891 1028 samfile_50
7 Sample 3459 36 42 samfile_51
8 Sample 81944 2994 3633 samfile_52
9 Sample 34541 866 1089 samfile_55
10 Sample 36350 975 1125 samfile_57
11 Sample 22824 450 546 samfile_58
12 Sample 22065 468 563 samfile_18
13 Sample 7403 96 120 samfile_20
58
14 Sample 15038 271 322 samfile_21
15 Sample 10624 158 207 samfile_22
16 Sample 36343 988 1195 samfile_43
17 Sample 28934 674 795 samfile_44
18 Sample 9812 133 159 samfile_45
19 Sample 28248 798 913 samfile_46
20 Sample 31060 616 738 samfile_53
21 Sample 32280 828 954 samfile_54
22 Sample 48987 1427 1628 samfile_56
(Age at first calving trait samples, SNPs found an individual samples and total SNPs-
“17194”found from 20samples.It’s Unique stacks id, polymorphic loci also given.)
Table. 13.1 Marker
59
(Above this fig. calculated total genoytpes,genotypefrequencies,Mean log likelihood and
Genotype Map.)
SNPs: (Identify four SNPs from individual sequence.)
60
Fig18. SNPs
Haplotypes: (Identify two Haplotypes,chr:Gk000002.2, 36.77Mb,- ,LnL:-18.745. Genotype
frequency aa:1(50.0%) & ab:1(50.0%). Two sample i.esamfile 17 &samfile 54
matches.Identify Alleles fom each column of SNPs.)
Fig 19. Haplotypes
DISCUSSION
61
In the present study, we generated 47Gb sequence data by targeted ddRAD sequence. The data
were mapped against cattle genome assemblywith overall mapping rate of ~98%. Mapping rate
was higher compared to that reported in an earlier study23, mainly due to experiment design,
wherein we have targeted coding regions which are conserved compared to other parts of the
genome, followed by detection of SNPs.A Catalog to record all loci identified in a population
and matches individuals to that Catalog to determine which haplotype alleles are present at
every locus in each individual have seen above the figure. I have calculated summary statistics,
such as heterozygosity, π, and FIS for the population program.I have found out the Variant
position and all poitions (variant&fixed).The total genotypes, genotype frequencies, Mean log
likelihood and genotype map has seen the result.
62
CHAPTER V
CONCLUSION
ddRAD sequences are developed to focus short reads to specific, restriction-enzyme anchored
positions in the genome.The power of short read sequencing technology and reduced
representation of genome coverage to call sequence variation in the progeny of a segregating
mapping population.Our investigation provides clear evidence that Stacks is designed as a
modular pipeline to efficiently curate and assemble large numbers of short-read sequences
from multiple samples using ddRAD and by extension, other related techniques such as
RADseq and GBS, are useful tools. Identifies loci in a set of individuals, either de novo or
aligned to a reference genome (including gapped alignments), and then genotypes each locus.
Stacks incorporates a Mean Log likelihood statistical model and identify polymorphic
loci(0.3585%) and distinguish them from sequencing errors. It employs a Catalog to record all
loci identified in a population and matches individuals to that Catalog to determine which
haplotype alleles are present at every locus in each individual.The total SNPs found in buffalo
three important traits such as :1-Milk yield, 2-Lactation period, 3-Age at first calving
“25802”,”9218”& “17914” .From each trait got 1 to 36947 tags, chromosome location-
GJ057537.1,start 0 mb to end 159 mb.Type is exon.Alleles found 1 to 100, matching sample: 1
to 1000,LnL:-500 to o. To compute population genetic measures such as FIS and π within
populations and FST between populations.
63
REFERENCE
• Rudolph, M. C. et al., Metabolic regulation in the lactating mammary gland: a lipid synthesizing machine. Physiol. Genomics, 2007, 28, 323–336.
• Ding, X. et al., A novel single nucleotide polymorphism in exon 7 of LPL gene and its association with carcass traits and visceral fat deposition in yak (Bos grunniens) steers. Mol. Biol. Rep., 2012, 39, 669–673.
• Van Horn, C. G., Caviglia, J. M., Li, L. O., Wang, S., Granger, D. and Coleman, R. A., Characterization of recombinant longchain rat acyl-CoA synthetase isoforms 3 and 6: identification of a novel variant of isoform 6. Biochemistry, 2005, 44, 1635–1642.
• Mashek, D. G. and Coleman, R. A., Cellular fatty acid uptake: the contribution of metabolism. Curr. Opin. Lipidol., 2006, 17, 274–278.
• Mercade, A. et al., Characterization of the porcine acyl‐CoA
synthetaselong‐chain 4 gene and its association with growth and meat
quality traits. Anim. Genet., 2006, 37, 219–224.
• Schwehm, J. M., Kristyanne, E. S., Biggers, C. C. and Stites, W.E., Stability effects of increasing the hydrophobicity of solventexposed side chains in staphylococcal nuclease. Biochemistry,1998, 37, 6939–6948.
• ?Manjithaya, R. R. and Dighe, R. R., The 3 messenger RNA?untranslated region of bovine follicle-stimulating hormone downregulates reporter expression: involvement of Au-rich elements and transfactors. Biol.
Reprod., 2004, 71, 1158–1166.
• Rao, Y. S., Wang, Z. F., Chai, X. W., Nie, Q. H. and Zhang, X. Q., Relationship between 5 UTR length and gene expression pattern in chicken. Genetica, 2013, 141, 311–318.
• Cohen-Zinder, M. et al., Identification of a missense mutation in the bovine ABCG2 gene with a major effect on the qtl on chromosome 6 affecting milk yield and composition in Holstein cattle. Genome Res., 2005, 15, 936–944.
• Robenek, H. et al., Butyrophilin controls milk fat globule secretion.Proc.
Natl. Acad. Sci. USA, 2006, 103, 10385–10390.
64
CURRICULUM VITAE
Anjan kumar Pradhan
Email id:- [email protected]
Mob No:- 9668790024
OBJECTIVE
To enhance my performance for better output through a disciplined, organized and progressive ways with my sincerity, creativity,hard workentrusted to me.
ACADEMIC QUALIFICATION
QUALIFICATI
ON
SPECIALISTIO
N
UNIVERSITY/INSTITUTION YEAR OF
PASSING
% OF
MARK
M.Sc. Bioinformatics Orissa University Of Agriculture and Technology
continuing
B.Sc. Zoology(hns) FakirMohan University 2015 53.83
+2science Science SubarnaRekha Mahavidyalaya
2012 53.5
10th GhantuaSahajog High School 2010 71.5
BIOLOGICAL TOOLS AND DATABASES KNOWN
• Pymol,PSI-PRED,RM2TS,RASPD,parDOCK ,Genemark,Glimmer,Easygene,GOR-V,Geno3D,Phyre2,Autodock,VMD,BlAST,MEGA.
• GenBank,EMBL,PDB,NCBI ,NDB,Uniprot-KB,SWISS-PROT,Prosite,Pfam,DDBJ.
COMPUTER LANGUAGE KNOWN
• PERL,CGI-Perl,PHP,HTML,C++,JAVA
AREA OF INTEREST
• I have interest in genomics,proteomics,molecular modeling drug design and cancer Biology.
PERSIONAL DETAILS
65
Father’s Name:- Mr. Satyajit Pradhan
Date Of Birth:- 21/01/1995
Nationality:- Indian
Language:- English, Hindi,Odia
Hobbies:- Reading newspaper,current affairs,browsing internet andVolleyball
PERMANENT ADDRESS
AT-Bishnupur,PO-Devog, DIST-Balasore, STATE-Odisha, PIN-756023
SEMINAR/WORKSHOP ATTENED
• Attended workshop on “Supercomputing Facility for Bioinformatics &
Computational Biology” during 12th Sep.,2016 to 22nd Sep.,2016 organized by Indian Institute Of Technology, Delhi.
• Attended national seminar on “Microbial Technology Prospects & Application”during 25-26 Dec,2015 organized by Orissa University Of Agriculture and Technology Bhubaneswar.
DECLARATION
I hereby declare that all the above mentioned information is true to the best of my knowledge and belief.
Date:
BHUBANESWAR Anjan Kumar Pradhan