Basic Search | Intermediate Search | Advanced SQL Search | Gene Image Map |
Home

*Chlamydia pneumoniae*

This analysis examines the DNA sequence content of the organism in order to
detect laterally transferred genomic islands.

Microbial genomes show homogeneity in their sequence composition. Portions of a
genome which have been laterally transferred show sequence characteristics
representative of the source organism. Therefore, analysis of the organism's
sequence characteristics allows detection of lateral transfer events occurring
between two organisms.

The three base composition analyses we perform are G+C content, genomic
signature, and codon bias. These three parameters are calculated for each
coding sequence. A coding region with parameter values indicating substantial
deviation from the genome is labeled as a transfer event.

The equations for genomic signature and codon bias were established by Samuel Karlin, et al. [1,2].

G+C content is calculated as the percentage of guanine and cytosine nucleotides
in each of the coding regions.

Genomic signature, which examines dinucleotide bias, is calculated for each
coding region. The genomic signature consists of the array of dinucleotide
relative abundance values ^{*}_{XY} determined with the following equation:

**
{ ^{*}
_{XY} = f^{*}
_{XY}/f^{*}
_{X}f^{*}
_{Y}},
**

where *f*^{*}_{X} and *f*^{*}
_{Y} denote the frequency of nucleotides X and Y respectively, and *f*^{*}_{XY} is the frequency of the dinucleotide XY.
The genomic signature difference between two sequences *f* and *g* is calculated as
follows:

where the sum extends over all dinucleotides.

In our calculations, ^{*}_{XY} (*f*) sequentially represents each coding region in the
input file and ^{*}_{XY} (*g*) represents the average value for the genome.

The genomic signature value is multiplied by 1000 and rounded to an integer when
displayed in the data table.

The codon bias of each coding region is calculated. Codon bias represents the
selectivity of an organism for particular codons when encoding amino acids.
Codon usage, or the number of codons present in a sequence, is found in order to
determine the codon bias for the sequence.

The average codon frequency *g(x,y,z)* for the codon nucleotide triplet *(x,y,z)* is
determined as follows:

Here the sum extends over all codons translated to amino acid *a*.

The codon usage difference is determined as follows:

In our calculations *f(x,y,z)* sequentially represents each coding region on the input file and *g(x,y,z)* represents the average for the genome. *Pa*(F) is the average amino acid frequency for the coding sequence *f*.

The codon bias value is multiplied by 1000 and rounded to an integer when
displayed in the data table.

Note: Codon bias for each coding sequence is calculated twice. The first
time, the sequence is compared to the entire genome as explained above. The
second time, the sequence is compared to the set of ribosomal proteins from the
input data. Thus, in the second calculation, *g(x,y,z)* represents the average
for the set of ribosomal proteins.

Ribosomal proteins tend to show codon usage which varies from the rest of the
genome. Thus sequences which have been laterally transferred should have
deviant codon bias when compared to the genome and also when compared to the set
of ribosomal proteins.

Note: A coding sequence having a length which is not a multiple of 3 is not
included in the calculations for codon bias or for mean and standard deviation.
The output fields in the data table contain "N/A" for these sequences.

For each of the above parameters, the mean and standard deviation is calculated.
These values are reported at the top of the output table.

For each coding sequence, the number of standard deviations from the mean for
each parameter is given. The number of standard deviations is represented in
the table as follows:

* difference from mean is greater than 1.5 standard deviations ** difference from mean is greater than 2 standard deviations *** difference from mean is greater than 3 standard deviations **** difference from mean is greater than 4 standard deviations ***** difference from mean is greater than 5 standard deviations ****** difference from mean is greater than 6 standard deviations ******* difference from mean is greater than 7 standard deviations

For G+C content, the number of standard deviations from the mean is
characterized based on whether the value is larger or smaller than the mean.
Deviations from the mean which are either smaller or larger indicate deviation
from the genome characteristics.

However, for genomic signature and codon bias, the number of standard deviations
from the mean is reported only for those genes which have values larger than the
mean. By definition, these two measures indicate a greater deviation from the
genome via larger numbers. Thus, values which are greater than the mean are of
interest when searching for deviant genes.

Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes.

Trends Microbiol. 2001 Jul; 9(7):335-43.

PMID: 11435108

[2] Karlin S, Campbell AM, Mrazek J.

Comparative DNA analysis across diverse genomes.

Annu Rev Genet. 1998; 32:185-225.

PMID: 9928479

Operated by the University of California
for the National Nuclear Security Administration, of the US Department of Energy. Copyright © 2001 UC | Disclaimer/Privacy |