Upload
sundeepjaglan
View
217
Download
0
Embed Size (px)
Citation preview
8/9/2019 EST(expressed sequence tags)
1/5
ABSTRACT:
Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensiveroute for discovering new genes, for obtaining data on gene expression and regulation, and for
the construction of the Genome maps. Polymorphism is a DNA sequence variation occurring
when a single nucleotide - A, T, C, or G - in the genome differs between members of a species,or between paired chromosomes in an individual. It would be worthwhile inventorying SNP and
INDEL variation, as it could become the raw material for future high throughput genotyping
technologies. Such information is presently being collected in humans and a division ofGenBank, dbSNP, has been specially devoted to the storage of such data. In crop industry, the
use of SNP and INDEL as marker tools will be challenging because the high ploidy level may
prevent the straightforward application of emerging technologies developed in diploid model
organisms. However, it is hoped that with the rapid evolution of technology and the increasingknowledge of molecular variation patterns in crop and other organisms, specific tools will
emerge which will possibly impact on future breeding programs for better economy.
Introduction:
Expressed Sequence Tags (ESTs) are 200-500 bp sequences that are obtained as part of a
3 or 5 single pass read of individual clones. These clones are derived from cDNA libraries that
may be specific to a tissue and/or development state of an organism, which can be used toidentify expressed genes that are considered rare when exploring the overall expression of an
organisms transcriptome. These rare transcripts may have splice variants and/or alternative
polyadenylation. ESTs are popular primarily because they can be generated in high volume,
using a high-throughput data production method at low cost. Despite serving as a rich source ofsequence information as a survey of a transcriptome, ESTs must be used with an understanding
of their limitations. The principle problems are that EST sequences contain errors, represent only
a portion of a gene product, and are found in vast, highly redundant datasets.
ESTs contain sequencing errors (sequence compression and frame-shift errors) at high
rates (3%) due to the single pass read process they were generated from. These errors do not
follow a normal distribution along the length of the sequence, but rather are biased toward thestart and end of the sequence, leaving EST base pair positions 100 to 300 to be the most accurate
part of the EST.
Each EST in itself represents only a portion of a gene product. Being 200-500 bp meansthat there are potentially several thousand base pairs of the underlying transcript that are
unrepresented in the EST sequence. The attenuation of the sequencing reaction used in
generating the EST leads to a length that is shorter than the cDNA clone it was derived from.
EST single pass reads can be generated from the 5 or 3 end. ESTs can also be generated using
random primers, which may result in ESTs with ambiguous orientation, from different parts ofthe same, non-overlapping RNA.
Databases that contain EST data also have two primary limitations.
1). ESTs, as a whole, are poorly annotated, both in terms of source and sequence
quality, which makes it difficult to determine what gene product a given EST represents.
8/9/2019 EST(expressed sequence tags)
2/5
2). EST databases contain a huge number of sequences, at a high rate of redundancy,
which makes it difficult for a researcher to negotiate and derive concise value from it.
Polymorphisms:
Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensive
route for discovering new genes, for obtaining data on gene expression and regulation, and forthe construction of the Genome maps. A single nucleotide polymorphism, or SNP, pronounced
snip, is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the
genome differs between members of a species, or between paired chromosomes in an individual.Single Nucleotide Polymorphisms may fall within coding sequences of genes, non-coding
regions of genes, or in the inter-genic regions. SNPs within a coding sequence will not
necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of
the genetic code. If DNA sequence in which any change in the base pairs do not result in thechange of polypeptide sequence, then it is termed, synonymous, sometimes called a silent
mutation - if a different polypeptide sequence is produced, they are non-synonymous. SNPs that
are not in protein-coding regions may still have consequences for gene splicing, transcription
factor binding, or the sequence of non-coding RNA. The study of single nucleotidepolymorphisms is also important in crop and livestock breeding programs. Expressed sequence
tags (ESTs) are an important resource for identifying polymorphisms in transcribed regions.
Single nucleotide polymorphisms (SNPs) have been shown to be the most abundant
source of DNA polymorphism in human, animal and plant genomes. SNPs are the most common
type of alleles found within and between varieties of a crop species. Single Nucleotide
polymorphisms (SNPs) possess desirable properties as molecular markers. Biallelism makesthem easy to score in high throughput genotyping assays. Molecular genetic markers developed
from ESTs can be used to examine a group of individuals or populations to estimate various
diversity measures and genetic distances, infer genetic structure and clustering patterns, test for
Hardy-Weinberg equilibrium and multi-locus equilibrium, and to test polymorphic loci forevidence of selective neutrality. They are useful to plant breeders, germplasm managers, and
population geneticists. The use of EST sequence data for the identification of SNPs has many
advantages that can be exploited to facilitate the development of highly dense genetic maps andmarkers assisted breeding programs. SNPs can be used to saturate genetic maps in plants.
Expressed sequence tag (EST) sequencing programs have provided a wealth of
information, identifying novel genes from a broad range of organisms and providing an
indication of gene expression level in particular tissues. EST sequence data may provide the
richest source of biologically useful SNPs due to the relatively high redundancy of gene
sequence, the diversity of genotypes represented within databases, and the fact that each SNPwould be associated with an expressed gene.
Working With EST For Polymorphism Search:
SNP detection perl script AutoSNP version.1.0 is mostly used to find the SNP site
information and transition vs transversion analysis. EST-SNP can be detected by using other
programs or servers such as SEAN, PolyPhred], PolyBayes, TRACE_DIFF, HaploSNPer and
8/9/2019 EST(expressed sequence tags)
3/5
HarvEST, but AutoSNP provides user friendly approach and interpretable result as html file.
SNPs can be classified based on their nucleotide substitution as either transition (GA or CT)or transversion (CG, AT, CA or TG). Indel sites can classified to four groups based onthe nucleotide involved (A/T/C/G). Thus there are ten kinds of SNP/indel, two types of
transition, four types of transversion and four groups of indels, are possible in genomes.
There are several strategies, both experimental and computational for SNP discovery.
Experimental SNP discovery often consists of a number of laborious steps that make this process
complex and expensive. The computational approach makes use of the large sequence datasetspresent in public databases. Over the last few years, a number of pipelines have been developed
that automatically detect SNPs in such databases. One type of pipeline detects SNPs using trace
files or quality files, for example the PHRED/PHRAP/PolyBayes system. The other type of
pipeline uses only EST redundancy in text-based sequence files to detect SNPs; these includeautoSNP and SNiPpER. Both autoSNP and SNiPpER are based on sequence redundancy for the
initial detection of SNPs, and sequencing errors are detected and filtered out by analyzing SNP
patterns.
Genetic maps are usually constructed with DNA markers such as restriction fragmentlength polymorphism (RFLP), amplified fragment lengthpolymorphisms (AFLP) or simple
sequences repeats (SSR) [1, 2]. With the advancement of sequencing technology, geneticvariation at the DNA sequence level can be easily analyzed. In maize, the frequency of
occurrence of single nucleotide polymorphisms (SNP) is quite high, on average 1 per 31 base
pairs (bp) in the 3 untranslated regions (UTR), and 1 every 124 bp in coding regions . Insertion-
deletion polymorphisms (indels) are also very frequent. The SNPs within an amplicon arephysically linked and therefore form a distinct haplotype. The abundance of SNPs makes them
highly useful for placing ESTs or candidate genes onto a genetic map, which has been previously
constructed with other markers. These polymorphisms can either be drawn from the public data,
or discovered by re-sequencing of the locus of interest in the mapping parents. For example, one
would expect to find polymorphisms in the 3-UTR region between the B73 and Mo17 in morethan half of the cases. This was demonstrated by a re-sequencing effort of 502 maize loci across
8 maize inbred lines, including the parents for the IBM population. The loci that aremonomorphic between B73 and Mo17 can be mapped using other populations. Once
polymorphisms are known, a number of SNP analysis methods are available for the scoring of
the polymorphisms associated with each locus in a mapping population.
A simple Polymorphism search
flow diagram.
Single pass sequencing of the 5'and/or 3' ends of randomly selected cDNA
clones, is an effective approach to provide
genetic information of an organism. These
sequences can serve as markers or tags fortranscripts, and have been used in the
development of SNP markers for reference
genetic map and recovery of full-lengthcDNA and genomic sequences. Expressed
8/9/2019 EST(expressed sequence tags)
4/5
sequence tags (ESTs) are also useful for the discovery of novel genes, investigation of genes of
unknown function, comparative genomic study, and recognition of exon/intronboundaries.Currently there are billions of sequences are being uploaded day-by-day, and
majority of these sequences are ESTs which had been deposited at NCBI (dbEST)
http://www.ncbi.nlm.nih.gov/ dbEST/. The lack of sequence information has limited the progress
of gene discovery and characterization, global transcript profiling, probe design for developmentof gene arrays, and generation of molecular markers for.
Single nucleotide polymorphisms (SNPs) are a second class of genetic markers that canbe mined from sequence data and are useful for characterizing allelic variation, genome-wide
mapping, and as a tool for marker-assisted selection. In the field of human genetics, SNPs are a
major focus of efforts to increase the efficiency of mapping and are already being used for
detection and mapping of a variety of diseases. In many crop plants, SNPs are present withsufficient frequency to offer an alternative for genetic mapping and markerassisted selection.
Although SNPs can be identified by sequencing selected DNA fragments, a practical limitation
to this approach follows from the fact that the sequencing error rate is often higher than the
polymorphism rate. The cost of SNP discovery through sequencing amplified fragments istherefore high even with reductions in the cost of sequencing. SNP detecting perl scripts
AutoSNP version 1.0 is used indentify the SNP / Indel polymorphisms, DNA substitution like
Transversion vs Transition and Indel.
In EST database, dbEST extract your required sequences. Normally, CAP3 program is
used to assemble the EST sequence in to contigs. The SNP detection tool AutoSNP version.1.0 is
used to find the candidate SNPs from these libraries. AutoSNP required input as ace or fastaformat. But the perl script edited manually to analyse fasta or ace format. Sequence assembly
program CAP3 is integrated in AUTOSNP to make fasta files in to contigs. The DNA
substitution as transition (Ts) versus transversion (Tv) ratio of all the libraries in genome of
interest is also calculated.
Your software works and gives you the result based on its own integrated algorithms. The
calculated result can be easily understood. The software gives you result as SNPs per 100 or per
kb, transversion(Tv) and transition(Ts) percentage and their ratio. It also provides Indels ratio inthe genome of the given species. The Ts vs Tv ratio is important to understand the pattern of
DNA evolution. It has been repeatedly noted that at low level of genetic divergence, Ts/Tv
appears to be high and at high levels of genetic divergence, Ts/Tv appears to be low. ESTanalysis shows an insight of intrer-specific molecular genetic variations within a species. The
transitions to transversions ratio showed the molecular evolution happening in and is important
for defining phylogenetic trees.
EXAMPLE RESULTS of Ginger Officinate:
8/9/2019 EST(expressed sequence tags)
5/5