Codon Usage Poster v1

8/6/2019 Codon Usage Poster v1

http://slidepdf.com/reader/full/codon-usage-poster-v1 1/1

Fd d od po: oom oo o doy o q w poy o mo

o determine whether the relation observed waspersistent in the data a second round o calculationswas carried out, this time limiting the number o sampled members per group to the number o

members in the smallest one (logically, the groupwith the largest number o variations). Te basegroup rom which the distances were calculated waslimited too. Te trend is shown on Figure 2.

POPOV I.1, NENOV A.2, PETROV P.3, VASSILEV D.1*

1 - AgroBioInstitute, 2 - Dynomica Ltd, 3-Sofa University, FMI

* Corresponding author: e-mail [email protected]

Objective

Te major goal o the study is to determinethe association between codon usage and theinherent mutability o the sequence. o searchor a model that can be used to classiy sequencesin groups based on their codon content, andassociate them to the other sequences in thegroup is also in the scope o the work.

Data set

For the purpose o our research ve groupso coding sequences (mRNAs) were selectedrom the latest available version o the EMBLNucleotide Sequence database. Te groups werebuilt based on the data or genetic variationsrom Swiss-Prot, which includes only missensechanges. Tis data was entered into a MySQLdatabase and the ollowing two tables weregenerated:

Gene – showing the number o diferent variations and diseases connected to thespecic gene name.Disease – showing the number o diferentgenes and variations connected to therespective disease name.

Tese tables were used to select genes with acertain number o variations in the ollowingve groups:

Null group – genes that were actually notpresent in the variations dataset, meaning thatthere were no registered/annotated variationsor them. Tis o course is not a group o genes “with no mutations”, but is still anotherreerence or the other our groups.Genes with less than 6 variations.Genes with a number o variations between 6and 15.Genes with more than 15 variations.Every gene with recorded variations.

Tese groups were selected in order to showthat there is a correlation between the number

o variations occurring in the sequence and itscodon usage. Tis is why groups were selectedbased on the total number o variations, andnot based on what diseases they were connectedto. o determine the connection between genes

•

•

•

••

••

linked to a disease, and codon usage, a sixth group wasselected. It encompassed all the genes with variations

that were connected to a certain kind o cancer, orwere ound in a cancer sample (and so could beconnected to cancer). Te number o sequences inall the groups can be ound in able 1.

Group SequencesNull 18742With 8457<6 69096 to 15 1124>15 424Cancer related genes 868

MethODs anD algOrithMs

For each o the above groups the EMBOSS applicationCUSP was used to calculate synonymous codonractions and the relative requency o every codonper 1000 codons in the sequence. CUSP works by simply counting the codons in the sequence andcalculating statistics or each o them based on aminoacid and total number o occurrences. Tis data wasgenerated or each sequence and or the group as awhole. Te results or the whole groups were usedto measure the diference between them. Whencalculating the distance the aim was to preserve thediference imposed by the requency o every codon.Tis is why we used Euclidean distance in 64D spaceas a distance unction. Te requency and raction

numbers or each codon were taken as coordinatesin this 64D space and the distance between theso dened points, representing the groups, and areerence was calculated (able 2). Te CUSP outputor the whole dataset was used as a reerence point orthe distance calculation. A graphical representationo the distances can be seen on Figure 1.

Group DistanceNull 2,1971714With 2,4203601<6 2,48400206 to 15 3,5784836>15 8,5619555Cancer related genes 7,3694174

backgrOunD

Mo d o d o d oo. Mo om o mo od, p o o o w d. how, o o m po ow o q moy d o, d o m o d. W oo w q d o d, d q o (odo, od omo, .) dmd y y odo mo df q op. T xmd o mp o o dm odo qy d o o wp yoymo odo d o od ommo mo d. T m pop o q d o omp o

o q. T m do w wo op o q, o xm pop o op d poy m o y o q.

able 1. Number of sequences in each group.

able 2. Results from the distance calculations between the gene groups and the reference (whole dataset). Null - genes without any variation data in Swiss-Prot; With - All genes with recorded variations; <6/ 6 to 15 / >15 - groups with the respective number of variations recorded.

Figure 1. Graphical representation of the distanceresults for the all groups.

Figure 2. Graphical representation of the distanceresults for the limited groups.

0.0000000

2.0000000

4.0000000

6.0000000

8.0000000

10.0000000

12.0000000

NONE With <6 6 to 15 >15 cancer

0.0000000

1.0000000

2.0000000

3.0000000

4.0000000

5.0000000

6.0000000

7.0000000

8.0000000

9.0000000


A second test o the results was done using verandomly picked groups o 400 sequences (closeto the number o sequences in the smallest group).Te CUSP results or these groups, compared to areerence group o another randomly picked 400sequences are represented on Figure 3.

Figure 3. Distance results for the random groups.

0.0000000

1.0000000

2.0000000

3.0000000

4.0000000

5.0000000

6.0000000

7.0000000

8.0000000

9.0000000

10.0000000


Documents

Codon Usage Poster v1