EST(expressed sequence tags)

8/9/2019 EST(expressed sequence tags)

1/5

ABSTRACT:

Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensiveroute for discovering new genes, for obtaining data on gene expression and regulation, and for

the construction of the Genome maps. Polymorphism is a DNA sequence variation occurring

when a single nucleotide - A, T, C, or G - in the genome differs between members of a species,or between paired chromosomes in an individual. It would be worthwhile inventorying SNP and

INDEL variation, as it could become the raw material for future high throughput genotyping

technologies. Such information is presently being collected in humans and a division ofGenBank, dbSNP, has been specially devoted to the storage of such data. In crop industry, the

use of SNP and INDEL as marker tools will be challenging because the high ploidy level may

prevent the straightforward application of emerging technologies developed in diploid model

organisms. However, it is hoped that with the rapid evolution of technology and the increasingknowledge of molecular variation patterns in crop and other organisms, specific tools will

emerge which will possibly impact on future breeding programs for better economy.

Introduction:

Expressed Sequence Tags (ESTs) are 200-500 bp sequences that are obtained as part of a

3 or 5 single pass read of individual clones. These clones are derived from cDNA libraries that

may be specific to a tissue and/or development state of an organism, which can be used toidentify expressed genes that are considered rare when exploring the overall expression of an

organisms transcriptome. These rare transcripts may have splice variants and/or alternative

polyadenylation. ESTs are popular primarily because they can be generated in high volume,

using a high-throughput data production method at low cost. Despite serving as a rich source ofsequence information as a survey of a transcriptome, ESTs must be used with an understanding

of their limitations. The principle problems are that EST sequences contain errors, represent only

a portion of a gene product, and are found in vast, highly redundant datasets.

ESTs contain sequencing errors (sequence compression and frame-shift errors) at high

rates (3%) due to the single pass read process they were generated from. These errors do not

follow a normal distribution along the length of the sequence, but rather are biased toward thestart and end of the sequence, leaving EST base pair positions 100 to 300 to be the most accurate

part of the EST.

Each EST in itself represents only a portion of a gene product. Being 200-500 bp meansthat there are potentially several thousand base pairs of the underlying transcript that are

unrepresented in the EST sequence. The attenuation of the sequencing reaction used in

generating the EST leads to a length that is shorter than the cDNA clone it was derived from.

EST single pass reads can be generated from the 5 or 3 end. ESTs can also be generated using

random primers, which may result in ESTs with ambiguous orientation, from different parts ofthe same, non-overlapping RNA.

Databases that contain EST data also have two primary limitations.

1). ESTs, as a whole, are poorly annotated, both in terms of source and sequence

quality, which makes it difficult to determine what gene product a given EST represents.


2/5

2). EST databases contain a huge number of sequences, at a high rate of redundancy,

which makes it difficult for a researcher to negotiate and derive concise value from it.

Polymorphisms:

Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensive

route for discovering new genes, for obtaining data on gene expression and regulation, and forthe construction of the Genome maps. A single nucleotide polymorphism, or SNP, pronounced

snip, is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the

genome differs between members of a species, or between paired chromosomes in an individual.Single Nucleotide Polymorphisms may fall within coding sequences of genes, non-coding

regions of genes, or in the inter-genic regions. SNPs within a coding sequence will not

necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of

the genetic code. If DNA sequence in which any change in the base pairs do not result in thechange of polypeptide sequence, then it is termed, synonymous, sometimes called a silent

mutation - if a different polypeptide sequence is produced, they are non-synonymous. SNPs that

are not in protein-coding regions may still have consequences for gene splicing, transcription

factor binding, or the sequence of non-coding RNA. The study of single nucleotidepolymorphisms is also important in crop and livestock breeding programs. Expressed sequence

tags (ESTs) are an important resource for identifying polymorphisms in transcribed regions.

Single nucleotide polymorphisms (SNPs) have been shown to be the most abundant

source of DNA polymorphism in human, animal and plant genomes. SNPs are the most common

type of alleles found within and between varieties of a crop species. Single Nucleotide

polymorphisms (SNPs) possess desirable properties as molecular markers. Biallelism makesthem easy to score in high throughput genotyping assays. Molecular genetic markers developed

from ESTs can be used to examine a group of individuals or populations to estimate various

diversity measures and genetic distances, infer genetic structure and clustering patterns, test for

Hardy-Weinberg equilibrium and multi-locus equilibrium, and to test polymorphic loci forevidence of selective neutrality. They are useful to plant breeders, germplasm managers, and

population geneticists. The use of EST sequence data for the identification of SNPs has many

advantages that can be exploited to facilitate the development of highly dense genetic maps andmarkers assisted breeding programs. SNPs can be used to saturate genetic maps in plants.

Expressed sequence tag (EST) sequencing programs have provided a wealth of

information, identifying novel genes from a broad range of organisms and providing an

indication of gene expression level in particular tissues. EST sequence data may provide the

richest source of biologically useful SNPs due to the relatively high redundancy of gene

sequence, the diversity of genotypes represented within databases, and the fact that each SNPwould be associated with an expressed gene.

Working With EST For Polymorphism Search:

SNP detection perl script AutoSNP version.1.0 is mostly used to find the SNP site

information and transition vs transversion analysis. EST-SNP can be detected by using other

programs or servers such as SEAN, PolyPhred], PolyBayes, TRACE_DIFF, HaploSNPer and


3/5

HarvEST, but AutoSNP provides user friendly approach and interpretable result as html file.

SNPs can be classified based on their nucleotide substitution as either transition (GA or CT)or transversion (CG, AT, CA or TG). Indel sites can classified to four groups based onthe nucleotide involved (A/T/C/G). Thus there are ten kinds of SNP/indel, two types of

transition, four types of transversion and four groups of indels, are possible in genomes.

There are several strategies, both experimental and computational for SNP discovery.

Experimental SNP discovery often consists of a number of laborious steps that make this process

complex and expensive. The computational approach makes use of the large sequence datasetspresent in public databases. Over the last few years, a number of pipelines have been developed

that automatically detect SNPs in such databases. One type of pipeline detects SNPs using trace

files or quality files, for example the PHRED/PHRAP/PolyBayes system. The other type of

pipeline uses only EST redundancy in text-based sequence files to detect SNPs; these includeautoSNP and SNiPpER. Both autoSNP and SNiPpER are based on sequence redundancy for the

initial detection of SNPs, and sequencing errors are detected and filtered out by analyzing SNP

patterns.

Genetic maps are usually constructed with DNA markers such as restriction fragmentlength polymorphism (RFLP), amplified fragment lengthpolymorphisms (AFLP) or simple

sequences repeats (SSR) [1, 2]. With the advancement of sequencing technology, geneticvariation at the DNA sequence level can be easily analyzed. In maize, the frequency of

occurrence of single nucleotide polymorphisms (SNP) is quite high, on average 1 per 31 base

pairs (bp) in the 3 untranslated regions (UTR), and 1 every 124 bp in coding regions . Insertion-

deletion polymorphisms (indels) are also very frequent. The SNPs within an amplicon arephysically linked and therefore form a distinct haplotype. The abundance of SNPs makes them

highly useful for placing ESTs or candidate genes onto a genetic map, which has been previously

constructed with other markers. These polymorphisms can either be drawn from the public data,

or discovered by re-sequencing of the locus of interest in the mapping parents. For example, one

would expect to find polymorphisms in the 3-UTR region between the B73 and Mo17 in morethan half of the cases. This was demonstrated by a re-sequencing effort of 502 maize loci across

8 maize inbred lines, including the parents for the IBM population. The loci that aremonomorphic between B73 and Mo17 can be mapped using other populations. Once

polymorphisms are known, a number of SNP analysis methods are available for the scoring of

the polymorphisms associated with each locus in a mapping population.

A simple Polymorphism search

flow diagram.

Single pass sequencing of the 5'and/or 3' ends of randomly selected cDNA

clones, is an effective approach to provide

genetic information of an organism. These

sequences can serve as markers or tags fortranscripts, and have been used in the

development of SNP markers for reference

genetic map and recovery of full-lengthcDNA and genomic sequences. Expressed


4/5

sequence tags (ESTs) are also useful for the discovery of novel genes, investigation of genes of

unknown function, comparative genomic study, and recognition of exon/intronboundaries.Currently there are billions of sequences are being uploaded day-by-day, and

majority of these sequences are ESTs which had been deposited at NCBI (dbEST)

http://www.ncbi.nlm.nih.gov/ dbEST/. The lack of sequence information has limited the progress

of gene discovery and characterization, global transcript profiling, probe design for developmentof gene arrays, and generation of molecular markers for.

Single nucleotide polymorphisms (SNPs) are a second class of genetic markers that canbe mined from sequence data and are useful for characterizing allelic variation, genome-wide

mapping, and as a tool for marker-assisted selection. In the field of human genetics, SNPs are a

major focus of efforts to increase the efficiency of mapping and are already being used for

detection and mapping of a variety of diseases. In many crop plants, SNPs are present withsufficient frequency to offer an alternative for genetic mapping and markerassisted selection.

Although SNPs can be identified by sequencing selected DNA fragments, a practical limitation

to this approach follows from the fact that the sequencing error rate is often higher than the

polymorphism rate. The cost of SNP discovery through sequencing amplified fragments istherefore high even with reductions in the cost of sequencing. SNP detecting perl scripts

AutoSNP version 1.0 is used indentify the SNP / Indel polymorphisms, DNA substitution like

Transversion vs Transition and Indel.

In EST database, dbEST extract your required sequences. Normally, CAP3 program is

used to assemble the EST sequence in to contigs. The SNP detection tool AutoSNP version.1.0 is

used to find the candidate SNPs from these libraries. AutoSNP required input as ace or fastaformat. But the perl script edited manually to analyse fasta or ace format. Sequence assembly

program CAP3 is integrated in AUTOSNP to make fasta files in to contigs. The DNA

substitution as transition (Ts) versus transversion (Tv) ratio of all the libraries in genome of

interest is also calculated.

Your software works and gives you the result based on its own integrated algorithms. The

calculated result can be easily understood. The software gives you result as SNPs per 100 or per

kb, transversion(Tv) and transition(Ts) percentage and their ratio. It also provides Indels ratio inthe genome of the given species. The Ts vs Tv ratio is important to understand the pattern of

DNA evolution. It has been repeatedly noted that at low level of genetic divergence, Ts/Tv

appears to be high and at high levels of genetic divergence, Ts/Tv appears to be low. ESTanalysis shows an insight of intrer-specific molecular genetic variations within a species. The

transitions to transversions ratio showed the molecular evolution happening in and is important

for defining phylogenetic trees.

EXAMPLE RESULTS of Ginger Officinate:


5/5

Documents

EST(expressed sequence tags)