2
Fd d od po : oom oo o do y o q w po y o mo o determine whether the relation observed was persistent in the data a second round o ca lculations was carried out, this time limiting the number o sampled members per group to the number o members in the smallest one (logically, the group with the largest number o variations). Te base group rom which the distances were calculated was limited too. Te trend is shown on Figure 2. POPOV I. 1 , NENOV A. 2 , PETROV P. 3 , VASSILEV D. 1* 1 - AgroBioInstitute, 2 - Dynomica Ltd, 3-Sofa University, FMI * Corresponding author: e-mail [email protected] Objective Te major goal o the study is to determine the association between codon usage and the inherent mutability o the sequence. o search or a model that can be used to classiy se quences in groups based on their codon content, and associate them to the other sequences in the group is also in the scope o the work. Data set For the purpose o our research ve groups o coding sequences (mRNAs) were selected rom the latest available version o the EMBL Nucleotide Sequence database. Te groups were built based on the data or genetic variations rom Swiss-Prot, which includes only missense changes. Tis data was entered into a MySQL database and the ollowing two tables were generated: Gene – showing the number o diferent  variations and diseases connected to the specic gene name. Disease – showing the number o diferent genes and variations connected to the respective disease name. Tese tables were used to select genes with a certain number o variations in the ollowing ve groups: Null group – genes that were actually not present in the variations dataset, meaning that there were no registered/annotated variations or them. Tis o course is not a group o genes “with no mutations”, but is still another reerence or the other our groups. Genes with less than 6 variations. Genes with a number o variations between 6 and 15. Genes with more than 15 variations. Every gene with recorded variations. Tese groups were selected in order to show that there is a correlation between the number o variations occurring in the sequence and its codon usage. Tis is why groups were selected based on the total number o variations, and not based on what diseases they were connected to. o determine the connection between genes linked to a disease, and codon usage, a sixth group was selected. It encompassed all the genes with variations that were connected to a certain kind o cancer, or were ound in a cancer sample (and so could be connected to cancer). Te number o sequences in all the groups can be ound in able 1. Group Sequences Null 18742 With 8457 <6 6909 6 to 15 1124 >15 424 Cancer related genes 868 MethODs anD algOrithMs For each o the above groups the EMBOSS application CUSP was used to calculate synonymous codon ractions and the relative requency o every codon per 1000 codons in the sequence. CUSP works by simply counting the codons in the sequence and calculating statistics or each o them based on amino acid and total number o occurrences. Tis data was generated or each sequence and or the group as a whole. Te results or the whole groups were used to measure the diference between them. When calculating the distance the aim was to preserve the diference imposed by the requency o ever y codon. Tis is why we used Euclidean distance in 64D space as a distance unction. Te requency and raction numbers or each codon were taken as coordinates in this 64D space and the distance between the so dened points, representing the groups, and a reerence was calcu lated (ab le 2). Te CUSP output or the whole dataset was used as a reerence point or the distance calculation. A graphical representation o the distances can be seen on Figure 1. Group Distance Null 2,1971714 With 2,4203601 <6 2,4840020 6 to 15 3,5784836 >15 8,5619555 Cancer related genes 7,3694174 backgrOunD Mo d o d o d oo. Mo om o mo od, p o o o w d. how, o o m po ow o q mo y d o, d o m o d. W oo w q d o d, d q o (odo, od omo, .) dmd y y odo mo df q op. T xmd o mp o o dm odo qy d o o w p yoymo odo d o od ommo mo d. T m pop o q d o omp o o q. T m do w wo op o q, o xm pop o op d poy m o y o q. able 1. Number of sequences in each group. able 2. Results from the distance calculations between the gene  groups and the reference (whole dataset). Null - genes without any variation data in Swiss-Prot; With - All genes with recorded variations; <6/ 6 to 15 / >15 - groups with the respective number of variations recorded. Figure 1. Graphical representation of the distance results for the all groups. Figure 2. Graphical representation of the distance results for the limited groups. 0.0000000 2.0000000 4.0000000 6.0000000 8.0000000 10.0000000 12.0000000 NONE With <6 6to15 >15 cancer  0.0000000 1.0000000 2.0000000 3.0000000 4.0000000 5.0000000 6.0000000 7.0000000 8.0000000 9.0000000 NONE With <6 6 to15 >15 cancer  A second test o the results was done using ve randomly picked groups o 400 sequences (close to the number o sequences in the smallest group). Te CUSP results or these groups, compared to a reerence group o another randomly picked 400 sequences are represented on Figure 3. Figure 3. Distance results for the random groups. 0.0000000 1.0000000 2.0000000 3.0000000 4.0000000 5.0000000 6.0000000 7.0000000 8.0000000 9.0000000 10.0000000 NONE With <6 6to15 >15 cancer  

Codon Usage Poster v1

Embed Size (px)

Citation preview

Page 1: Codon Usage Poster v1

8/6/2019 Codon Usage Poster v1

http://slidepdf.com/reader/full/codon-usage-poster-v1 1/1

Fd d od po: oom oo o doy o q w poy o mo

o determine whether the relation observed waspersistent in the data a second round o calculationswas carried out, this time limiting the number o sampled members per group to the number o 

members in the smallest one (logically, the groupwith the largest number o variations). Te basegroup rom which the distances were calculated waslimited too. Te trend is shown on Figure 2.

POPOV I.1, NENOV A.2, PETROV P.3, VASSILEV D.1*

1 - AgroBioInstitute, 2 - Dynomica Ltd, 3-Sofa University, FMI

* Corresponding author: e-mail [email protected]

Objective

Te major goal o the study is to determinethe association between codon usage and theinherent mutability o the sequence. o searchor a model that can be used to classiy sequencesin groups based on their codon content, andassociate them to the other sequences in thegroup is also in the scope o the work.

Data set

For the purpose o our research ve groupso coding sequences (mRNAs) were selectedrom the latest available version o the EMBLNucleotide Sequence database. Te groups werebuilt based on the data or genetic variationsrom Swiss-Prot, which includes only missensechanges. Tis data was entered into a MySQLdatabase and the ollowing two tables weregenerated:

Gene – showing the number o diferent  variations and diseases connected to thespecic gene name.Disease – showing the number o diferentgenes and variations connected to therespective disease name.

Tese tables were used to select genes with acertain number o variations in the ollowingve groups:

Null group – genes that were actually notpresent in the variations dataset, meaning thatthere were no registered/annotated variationsor them. Tis o course is not a group o genes “with no mutations”, but is still anotherreerence or the other our groups.Genes with less than 6 variations.Genes with a number o variations between 6and 15.Genes with more than 15 variations.Every gene with recorded variations.

Tese groups were selected in order to showthat there is a correlation between the number

o variations occurring in the sequence and itscodon usage. Tis is why groups were selectedbased on the total number o variations, andnot based on what diseases they were connectedto. o determine the connection between genes

••

••

linked to a disease, and codon usage, a sixth group wasselected. It encompassed all the genes with variations

that were connected to a certain kind o cancer, orwere ound in a cancer sample (and so could beconnected to cancer). Te number o sequences inall the groups can be ound in able 1.

Group SequencesNull 18742With 8457<6 69096 to 15 1124>15 424Cancer related genes 868

MethODs anD algOrithMs

For each o the above groups the EMBOSS applicationCUSP was used to calculate synonymous codonractions and the relative requency o every codonper 1000 codons in the sequence. CUSP works by simply counting the codons in the sequence andcalculating statistics or each o them based on aminoacid and total number o occurrences. Tis data wasgenerated or each sequence and or the group as awhole. Te results or the whole groups were usedto measure the diference between them. Whencalculating the distance the aim was to preserve thediference imposed by the requency o every codon.Tis is why we used Euclidean distance in 64D spaceas a distance unction. Te requency and raction

numbers or each codon were taken as coordinatesin this 64D space and the distance between theso dened points, representing the groups, and areerence was calculated (able 2). Te CUSP outputor the whole dataset was used as a reerence point orthe distance calculation. A graphical representationo the distances can be seen on Figure 1.

Group DistanceNull 2,1971714With 2,4203601<6 2,48400206 to 15 3,5784836>15 8,5619555Cancer related genes 7,3694174

backgrOunD

Mo d o d o d oo. Mo om o mo od, p o o o w d. how, o o m po ow o q moy d o, d o m o d. W oo w q d o d, d q o (odo, od omo, .) dmd y y odo mo df q op. T xmd o mp o o dm odo qy d o o wp yoymo odo d o od ommo mo d. T m pop o q d o omp o

o q. T m do w wo op o q, o xm pop o op d poy m o y o q.

able 1. Number of sequences in each group.

able 2. Results from the distance calculations between the gene groups and the reference (whole dataset). Null - genes without any variation data in Swiss-Prot; With - All genes with recorded variations; <6/ 6 to 15 / >15 - groups with the respective number of variations recorded.

Figure 1. Graphical representation of the distanceresults for the all groups.

Figure 2. Graphical representation of the distanceresults for the limited groups.

0.0000000

2.0000000

4.0000000

6.0000000

8.0000000

10.0000000

12.0000000

NONE With <6 6 to 15 >15 cancer  

0.0000000

1.0000000

2.0000000

3.0000000

4.0000000

5.0000000

6.0000000

7.0000000

8.0000000

9.0000000

NONE With <6 6 to 15 >15 cancer  

A second test o the results was done using verandomly picked groups o 400 sequences (closeto the number o sequences in the smallest group).Te CUSP results or these groups, compared to areerence group o another randomly picked 400sequences are represented on Figure 3.

Figure 3. Distance results for the random groups.

0.0000000

1.0000000

2.0000000

3.0000000

4.0000000

5.0000000

6.0000000

7.0000000

8.0000000

9.0000000

10.0000000

NONE With <6 6 to 15 >15 cancer