Rhesus genome annotations Rob Norgren Department of Genetics, Cell Biology and Anatomy University of...

Preview:

Citation preview

Rhesus genome annotations

Rob NorgrenDepartment of Genetics, Cell Biology and AnatomyUniversity of Nebraska Medical Center

Conventional Approach to GeneChip Production

• Sequence millions of ESTs

• Obtain finished genomic sequences

• Cluster redundant ESTs

• Align EST clusters with genomic sequences

• Extract the last 571 bp of sequence from each transcript - probe selection region (PSR)

• Choose 11 to 16 probes that tile across the PSR

Problems with the conventional approaches for a rhesus macaque GeneChip

• Insufficient ESTs to cover most genes

• Little finished genomic sequence (in 2005)

Strategy for targeted amplification of rhesus genes

• Identify the terminal exon and flanking sequence for every human gene

• Design primers and amplify from monkey genomic DNA

• Obtain the rhesus PSR sequences

Terminal exon

PSRF R

Poly A

PSR: Probe selection regionF: forward primerR: reverse primer

Other sources for rhesus GeneChip PSRs

• Preliminary Baylor Genomic SequencesIn silico approach - Aligned human PSRs with preliminary rhesus genomic sequence.

• ESTs

Rhesus GeneChip

• Available in March 2005

• Novel design

• Whole genome expression array - 52,024 probes for 47,000 transcripts

• Probesets include 17,093 well-annotated genes (16 probes/probeset)

• Probesets were designed for 1,099 well-annotated genes not present on the U133+2.0 human GeneChip.

Rhesus Genome

• Draft published in Science on April 17, 2007

• “The rhesus macaque genome assembly is a draft DNA sequence, and it contains many gaps.”

What does a “draft” rhesus genome mean?

• 26,907 protein coding genes for the human

• 24,038 protein coding genes for rhesus macaques

• Sounds good, but is misleading.

• 19,450 well-annotated protein coding genes for humans

• 8,744 well-annotated protein coding genes for rhesus macaques

• What does “well annotated” mean”?

• No “hypothetical” genes

• Only genes with “good” gene symbols. No “Locs”.

Problems with GeneChip annotations

• Affymetrix relies on NCBI annotations, hence, many probesets are not annotated with “real” gene symbols

• Stop gap solution:http://www.unmc.edu/rheusgenechip

• Permanent solution requires full and complete annotation of the rhesus genome at NCBI.

What can go wrong at the genome sequencing center?

• Large gaps

• Small gaps

• Misassemblies

• Sequencing errors

What can go wrong with ab initio annotations?

• Incorrect assignment of pseudogene status

• Failure to identify genes

• Incorrect gene models (some exons right, some wrong)

• Incomplete gene models

Consequences of non-annotated genes

• Large number of databases depend on NCBI annotations for their annotations. Example: Affymetrix GeneChips

• Errors and omissions are propagated to dependent databases

• Users are frustrated when they see “Locs” instead of a proper gene symbol

• Users can Blast each probeset consensus sequence or ask their bioinformatics personnel to establish gene identity, but this is wasteful in time and energy.

How to correct annotations

• Annotations must be acceptable to NCBI, if they are not, corrections will not propagate to dependent databases.

• Some gene annotations can be corrected by manual inspection.

• Some gene annotations can be corrected by human ortholog-based gene models rather than ab initio approaches.

• Some gene annotations can only be corrected by additional sequencing.

• And some gene annotations require a trip to Hell...

Defensins - the gene family from Hell

• Large family of genes

• Orthologs poorly conserved - positive selection?

• Will require focused sequencing and annotation

• May require publication before NCBI annotates most of the rhesus defensins

Acknowledgements

• Jeff Kittrell

• Joel Goodsell

• Audrey Gomel

• NCRR/NIH

Recommended