EST(expressed sequence tags)

Embed Size (px)

Citation preview

  • 8/9/2019 EST(expressed sequence tags)

    1/5

    ABSTRACT:

    Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensiveroute for discovering new genes, for obtaining data on gene expression and regulation, and for

    the construction of the Genome maps. Polymorphism is a DNA sequence variation occurring

    when a single nucleotide - A, T, C, or G - in the genome differs between members of a species,or between paired chromosomes in an individual. It would be worthwhile inventorying SNP and

    INDEL variation, as it could become the raw material for future high throughput genotyping

    technologies. Such information is presently being collected in humans and a division ofGenBank, dbSNP, has been specially devoted to the storage of such data. In crop industry, the

    use of SNP and INDEL as marker tools will be challenging because the high ploidy level may

    prevent the straightforward application of emerging technologies developed in diploid model

    organisms. However, it is hoped that with the rapid evolution of technology and the increasingknowledge of molecular variation patterns in crop and other organisms, specific tools will

    emerge which will possibly impact on future breeding programs for better economy.

    Introduction:

    Expressed Sequence Tags (ESTs) are 200-500 bp sequences that are obtained as part of a

    3 or 5 single pass read of individual clones. These clones are derived from cDNA libraries that

    may be specific to a tissue and/or development state of an organism, which can be used toidentify expressed genes that are considered rare when exploring the overall expression of an

    organisms transcriptome. These rare transcripts may have splice variants and/or alternative

    polyadenylation. ESTs are popular primarily because they can be generated in high volume,

    using a high-throughput data production method at low cost. Despite serving as a rich source ofsequence information as a survey of a transcriptome, ESTs must be used with an understanding

    of their limitations. The principle problems are that EST sequences contain errors, represent only

    a portion of a gene product, and are found in vast, highly redundant datasets.

    ESTs contain sequencing errors (sequence compression and frame-shift errors) at high

    rates (3%) due to the single pass read process they were generated from. These errors do not

    follow a normal distribution along the length of the sequence, but rather are biased toward thestart and end of the sequence, leaving EST base pair positions 100 to 300 to be the most accurate

    part of the EST.

    Each EST in itself represents only a portion of a gene product. Being 200-500 bp meansthat there are potentially several thousand base pairs of the underlying transcript that are

    unrepresented in the EST sequence. The attenuation of the sequencing reaction used in

    generating the EST leads to a length that is shorter than the cDNA clone it was derived from.

    EST single pass reads can be generated from the 5 or 3 end. ESTs can also be generated using

    random primers, which may result in ESTs with ambiguous orientation, from different parts ofthe same, non-overlapping RNA.

    Databases that contain EST data also have two primary limitations.

    1). ESTs, as a whole, are poorly annotated, both in terms of source and sequence

    quality, which makes it difficult to determine what gene product a given EST represents.

  • 8/9/2019 EST(expressed sequence tags)

    2/5

    2). EST databases contain a huge number of sequences, at a high rate of redundancy,

    which makes it difficult for a researcher to negotiate and derive concise value from it.

    Polymorphisms:

    Expressed Sequence Tags or ESTs provide researchers with a quick and inexpensive

    route for discovering new genes, for obtaining data on gene expression and regulation, and forthe construction of the Genome maps. A single nucleotide polymorphism, or SNP, pronounced

    snip, is a DNA sequence variation occurring when a single nucleotide - A, T, C, or G - in the

    genome differs between members of a species, or between paired chromosomes in an individual.Single Nucleotide Polymorphisms may fall within coding sequences of genes, non-coding

    regions of genes, or in the inter-genic regions. SNPs within a coding sequence will not

    necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of

    the genetic code. If DNA sequence in which any change in the base pairs do not result in thechange of polypeptide sequence, then it is termed, synonymous, sometimes called a silent

    mutation - if a different polypeptide sequence is produced, they are non-synonymous. SNPs that

    are not in protein-coding regions may still have consequences for gene splicing, transcription

    factor binding, or the sequence of non-coding RNA. The study of single nucleotidepolymorphisms is also important in crop and livestock breeding programs. Expressed sequence

    tags (ESTs) are an important resource for identifying polymorphisms in transcribed regions.

    Single nucleotide polymorphisms (SNPs) have been shown to be the most abundant

    source of DNA polymorphism in human, animal and plant genomes. SNPs are the most common

    type of alleles found within and between varieties of a crop species. Single Nucleotide

    polymorphisms (SNPs) possess desirable properties as molecular markers. Biallelism makesthem easy to score in high throughput genotyping assays. Molecular genetic markers developed

    from ESTs can be used to examine a group of individuals or populations to estimate various

    diversity measures and genetic distances, infer genetic structure and clustering patterns, test for

    Hardy-Weinberg equilibrium and multi-locus equilibrium, and to test polymorphic loci forevidence of selective neutrality. They are useful to plant breeders, germplasm managers, and

    population geneticists. The use of EST sequence data for the identification of SNPs has many

    advantages that can be exploited to facilitate the development of highly dense genetic maps andmarkers assisted breeding programs. SNPs can be used to saturate genetic maps in plants.

    Expressed sequence tag (EST) sequencing programs have provided a wealth of

    information, identifying novel genes from a broad range of organisms and providing an

    indication of gene expression level in particular tissues. EST sequence data may provide the

    richest source of biologically useful SNPs due to the relatively high redundancy of gene

    sequence, the diversity of genotypes represented within databases, and the fact that each SNPwould be associated with an expressed gene.

    Working With EST For Polymorphism Search:

    SNP detection perl script AutoSNP version.1.0 is mostly used to find the SNP site

    information and transition vs transversion analysis. EST-SNP can be detected by using other

    programs or servers such as SEAN, PolyPhred], PolyBayes, TRACE_DIFF, HaploSNPer and

  • 8/9/2019 EST(expressed sequence tags)

    3/5

    HarvEST, but AutoSNP provides user friendly approach and interpretable result as html file.

    SNPs can be classified based on their nucleotide substitution as either transition (GA or CT)or transversion (CG, AT, CA or TG). Indel sites can classified to four groups based onthe nucleotide involved (A/T/C/G). Thus there are ten kinds of SNP/indel, two types of

    transition, four types of transversion and four groups of indels, are possible in genomes.

    There are several strategies, both experimental and computational for SNP discovery.

    Experimental SNP discovery often consists of a number of laborious steps that make this process

    complex and expensive. The computational approach makes use of the large sequence datasetspresent in public databases. Over the last few years, a number of pipelines have been developed

    that automatically detect SNPs in such databases. One type of pipeline detects SNPs using trace

    files or quality files, for example the PHRED/PHRAP/PolyBayes system. The other type of

    pipeline uses only EST redundancy in text-based sequence files to detect SNPs; these includeautoSNP and SNiPpER. Both autoSNP and SNiPpER are based on sequence redundancy for the

    initial detection of SNPs, and sequencing errors are detected and filtered out by analyzing SNP

    patterns.

    Genetic maps are usually constructed with DNA markers such as restriction fragmentlength polymorphism (RFLP), amplified fragment lengthpolymorphisms (AFLP) or simple

    sequences repeats (SSR) [1, 2]. With the advancement of sequencing technology, geneticvariation at the DNA sequence level can be easily analyzed. In maize, the frequency of

    occurrence of single nucleotide polymorphisms (SNP) is quite high, on average 1 per 31 base

    pairs (bp) in the 3 untranslated regions (UTR), and 1 every 124 bp in coding regions . Insertion-

    deletion polymorphisms (indels) are also very frequent. The SNPs within an amplicon arephysically linked and therefore form a distinct haplotype. The abundance of SNPs makes them

    highly useful for placing ESTs or candidate genes onto a genetic map, which has been previously

    constructed with other markers. These polymorphisms can either be drawn from the public data,

    or discovered by re-sequencing of the locus of interest in the mapping parents. For example, one

    would expect to find polymorphisms in the 3-UTR region between the B73 and Mo17 in morethan half of the cases. This was demonstrated by a re-sequencing effort of 502 maize loci across

    8 maize inbred lines, including the parents for the IBM population. The loci that aremonomorphic between B73 and Mo17 can be mapped using other populations. Once

    polymorphisms are known, a number of SNP analysis methods are available for the scoring of

    the polymorphisms associated with each locus in a mapping population.

    A simple Polymorphism search

    flow diagram.

    Single pass sequencing of the 5'and/or 3' ends of randomly selected cDNA

    clones, is an effective approach to provide

    genetic information of an organism. These

    sequences can serve as markers or tags fortranscripts, and have been used in the

    development of SNP markers for reference

    genetic map and recovery of full-lengthcDNA and genomic sequences. Expressed

  • 8/9/2019 EST(expressed sequence tags)

    4/5

    sequence tags (ESTs) are also useful for the discovery of novel genes, investigation of genes of

    unknown function, comparative genomic study, and recognition of exon/intronboundaries.Currently there are billions of sequences are being uploaded day-by-day, and

    majority of these sequences are ESTs which had been deposited at NCBI (dbEST)

    http://www.ncbi.nlm.nih.gov/ dbEST/. The lack of sequence information has limited the progress

    of gene discovery and characterization, global transcript profiling, probe design for developmentof gene arrays, and generation of molecular markers for.

    Single nucleotide polymorphisms (SNPs) are a second class of genetic markers that canbe mined from sequence data and are useful for characterizing allelic variation, genome-wide

    mapping, and as a tool for marker-assisted selection. In the field of human genetics, SNPs are a

    major focus of efforts to increase the efficiency of mapping and are already being used for

    detection and mapping of a variety of diseases. In many crop plants, SNPs are present withsufficient frequency to offer an alternative for genetic mapping and markerassisted selection.

    Although SNPs can be identified by sequencing selected DNA fragments, a practical limitation

    to this approach follows from the fact that the sequencing error rate is often higher than the

    polymorphism rate. The cost of SNP discovery through sequencing amplified fragments istherefore high even with reductions in the cost of sequencing. SNP detecting perl scripts

    AutoSNP version 1.0 is used indentify the SNP / Indel polymorphisms, DNA substitution like

    Transversion vs Transition and Indel.

    In EST database, dbEST extract your required sequences. Normally, CAP3 program is

    used to assemble the EST sequence in to contigs. The SNP detection tool AutoSNP version.1.0 is

    used to find the candidate SNPs from these libraries. AutoSNP required input as ace or fastaformat. But the perl script edited manually to analyse fasta or ace format. Sequence assembly

    program CAP3 is integrated in AUTOSNP to make fasta files in to contigs. The DNA

    substitution as transition (Ts) versus transversion (Tv) ratio of all the libraries in genome of

    interest is also calculated.

    Your software works and gives you the result based on its own integrated algorithms. The

    calculated result can be easily understood. The software gives you result as SNPs per 100 or per

    kb, transversion(Tv) and transition(Ts) percentage and their ratio. It also provides Indels ratio inthe genome of the given species. The Ts vs Tv ratio is important to understand the pattern of

    DNA evolution. It has been repeatedly noted that at low level of genetic divergence, Ts/Tv

    appears to be high and at high levels of genetic divergence, Ts/Tv appears to be low. ESTanalysis shows an insight of intrer-specific molecular genetic variations within a species. The

    transitions to transversions ratio showed the molecular evolution happening in and is important

    for defining phylogenetic trees.

    EXAMPLE RESULTS of Ginger Officinate:

  • 8/9/2019 EST(expressed sequence tags)

    5/5