Basic Search | Intermediate Search | Advanced SQL Search | Gene Image Map |  Home

Chlamydia pneumoniae



Methods

This analysis examines the DNA sequence content of the organism in order to detect laterally transferred genomic islands.

Microbial genomes show homogeneity in their sequence composition. Portions of a genome which have been laterally transferred show sequence characteristics representative of the source organism. Therefore, analysis of the organism's sequence characteristics allows detection of lateral transfer events occurring between two organisms.

The three base composition analyses we perform are G+C content, genomic signature, and codon bias. These three parameters are calculated for each coding sequence. A coding region with parameter values indicating substantial deviation from the genome is labeled as a transfer event.

The equations for genomic signature and codon bias were established by Samuel Karlin, et al. [1,2].



G+C Content

G+C content is calculated as the percentage of guanine and cytosine nucleotides in each of the coding regions.



Genomic Signature

Genomic signature, which examines dinucleotide bias, is calculated for each coding region. The genomic signature consists of the array of dinucleotide relative abundance values rho *XY determined with the following equation:

{rho * XY = f* XY/f* Xf* Y},

where f*X and f* Y denote the frequency of nucleotides X and Y respectively, and f*XY is the frequency of the dinucleotide XY. The genomic signature difference between two sequences f and g is calculated as follows:

where the sum extends over all dinucleotides.
In our calculations, rho *XY (f) sequentially represents each coding region in the input file and rho *XY (g) represents the average value for the genome.
The genomic signature value is multiplied by 1000 and rounded to an integer when displayed in the data table.



Codon Bias

The codon bias of each coding region is calculated. Codon bias represents the selectivity of an organism for particular codons when encoding amino acids. Codon usage, or the number of codons present in a sequence, is found in order to determine the codon bias for the sequence.

The average codon frequency g(x,y,z) for the codon nucleotide triplet (x,y,z) is determined as follows:

Here the sum extends over all codons translated to amino acid a.
The codon usage difference is determined as follows:

In our calculations f(x,y,z) sequentially represents each coding region on the input file and g(x,y,z) represents the average for the genome. Pa(F) is the average amino acid frequency for the coding sequence f.

The codon bias value is multiplied by 1000 and rounded to an integer when displayed in the data table.

Note: Codon bias for each coding sequence is calculated twice. The first time, the sequence is compared to the entire genome as explained above. The second time, the sequence is compared to the set of ribosomal proteins from the input data. Thus, in the second calculation, g(x,y,z) represents the average for the set of ribosomal proteins.

Ribosomal proteins tend to show codon usage which varies from the rest of the genome. Thus sequences which have been laterally transferred should have deviant codon bias when compared to the genome and also when compared to the set of ribosomal proteins.

Note: A coding sequence having a length which is not a multiple of 3 is not included in the calculations for codon bias or for mean and standard deviation. The output fields in the data table contain "N/A" for these sequences.



Mean and Standard Deviation

For each of the above parameters, the mean and standard deviation is calculated. These values are reported at the top of the output table.

For each coding sequence, the number of standard deviations from the mean for each parameter is given. The number of standard deviations is represented in the table as follows:

*       difference from mean is greater than 1.5 standard deviations
**      difference from mean is greater than 2   standard deviations
***     difference from mean is greater than 3   standard deviations
****    difference from mean is greater than 4   standard deviations
*****   difference from mean is greater than 5   standard deviations
******  difference from mean is greater than 6   standard deviations
******* difference from mean is greater than 7   standard deviations

For G+C content, the number of standard deviations from the mean is characterized based on whether the value is larger or smaller than the mean. Deviations from the mean which are either smaller or larger indicate deviation from the genome characteristics.

However, for genomic signature and codon bias, the number of standard deviations from the mean is reported only for those genes which have values larger than the mean. By definition, these two measures indicate a greater deviation from the genome via larger numbers. Thus, values which are greater than the mean are of interest when searching for deviant genes.



References

[1] Karlin S.
Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes.
Trends Microbiol. 2001 Jul; 9(7):335-43.
PMID: 11435108

[2] Karlin S, Campbell AM, Mrazek J.
Comparative DNA analysis across diverse genomes.
Annu Rev Genet. 1998; 32:185-225.
PMID: 9928479


Los Alamos National Laboratory     
Operated by the University of California for the National Nuclear Security Administration,
of the US Department of Energy.     Copyright © 2001 UC | Disclaimer/Privacy