U. UREALYTICUM GENOME AND PROTEOME PROPERTIES

COORDINATE SYSTEM

The origin of replication for the 751,719 bp circular genome is a region between the dnaA and rpL34 ORF. The first coordinate, taken from this region, is 170 bp upstream of the rpL34 coding sequence, the first gene, UU001. Plus strand coding sequences tend to be to the right of UU001 and minus strand coding sequences tend to be to the left. To view the sequence of this putative origin, click on the UU604 gene in the Gene Image Map, then on IGR515.

G + C CONTENT and CODON USAGE

The average G + C content of the genome is 26%. The codon usage as presented on the Codon Usage Database Web site (http://www.kazusa.or.jp/codon/) (Nakamura, Y., Gojobori, T. and Ikemura, T. (1997) Nucl. Acids Res. 25, 244-245.) is as follows (based on 7701 codons):

fields: [triplet] [frequency: per thousand] ([number])

    UUU 21.6(   166)  UCU 10.3(    79)  UAU 16.4(   126)  UGU  4.2(    32)
    UUC 14.5(   112)  UCC  0.8(     6)  UAC  7.5(    58)  UGC  1.3(    10)
    UUA 54.7(   421)  UCA 33.4(   257)  UAA  5.2(    40)  UGA  5.3(    41)
    UUG  6.9(    53)  UCG  0.8(     6)  UAG  0.5(     4)  UGG  0.6(     5)

    CUU  8.7(    67)  CCU  8.6(    66)  CAU 10.0(    77)  CGU 25.7(   198)
    CUC  0.3(     2)  CCC  0.8(     6)  CAC 10.8(    83)  CGC  1.9(    15)
    CUA 10.8(    83)  CCA 26.7(   206)  CAA 29.5(   227)  CGA  2.6(    20)
    CUG  0.8(     6)  CCG  1.6(    12)  CAG  1.8(    14)  CGG  0.6(     5)

    AUU 46.6(   359)  ACU 27.8(   214)  AAU 29.2(   225)  AGU 17.0(   131)
    AUC 10.8(    83)  ACC  1.3(    10)  AAC 16.4(   126)  AGC  5.5(    42)
    AUA  8.8(    68)  ACA 29.2(   225)  AAA 61.8(   476)  AGA 17.4(   134)
    AUG 25.5(   196)  ACG  1.9(    15)  AAG  8.7(    67)  AGG  0.9(     7)

    GUU 49.2(   379)  GCU 38.4(   296)  GAU 39.7(   306)  GGU 43.1(   332)
    GUC  4.5(    35)  GCC  2.6(    20)  GAC 16.8(   129)  GGC  4.9(    38)
    GUA 29.7(   229)  GCA 22.6(   174)  GAA 63.0(   485)  GGA 33.1(   255)
    GUG  6.9(    53)  GCG  1.7(    13)  GAG  3.6(    28)  GGG  6.2(    48)

GENERAL FEATURES OF THE PROTEOME

Current analyses of this database show 311 identifiable proteins and 149 conserved but uncharacterized proteins; the remaining ORFs (154) are hypothetical. The original assessment of ORFS made use of GeneMark and BLAST. Taking the GenBank deposited sequences as our starting point, each protein was analyzed with respect to similarities outside the genome (orthologs) using gapped BLAST and Psi-BLAST, similarities within the genome (isologs and paralogs) using gapped BLAST, similarities to COGs (clusters of orthologous groups) and to databases that emphasize domains, i.e. BLOCKS and ProDom (Documentation).

Of the 311 identified proteins, 263 have been assigned an EC number. The average molecular weight (length) of the 614 proteins is 42.0 kDa (1121 aa); the median molecular weight (length) is 16.6 kDa (432 aa). The average and median estimated pI values are 8.64 and 7.85.

MOL WT DISTRIBUTION FOR U. UREALYTICUM

x 1000KDa

Of the 614 predicted proteins, 50 appear to be secreted, 149 have transmembrane domains, 177 have coiled-coil regions as determined by suite of structural analysis programs in the SEALS package available from NCBI/NIH. Of 50 proteins predicted to have a G+ signal peptide, 2 had a best hit to E.coli whereas 4 had a best hit to B.subtilis (by criteria stated in the following paragraph). Curiously, the majority of secreted proteins were not that similar to either B.subtilis or E.coli. Nevertheless, signal peptide prediction for U. urealyticum was based solely on the B.subtilis (gram positive) model. Of the 614 proteins, 79 have similarity to a PDB database sequence using BLAST2. A hit requires an E value better than (lower than ) 0.0001.

When U. urealyticum proteins are compared to E.coli and B.subtilis proteins, using BLAST2, 3 are found to have a better hit to E.coli proteins, whereas 27 have a better hit to B.subtilis proteins, consistent with the finding that Mycoplasmas are gram-positive. In this analysis, the E value for the best hit to a genome must be 100 times smaller (better) than the E value for the second genome. Also, only E values of 1e-03 and lower are considered. Note that approximately 430 of U. urealyticum identified and conserved proteins (460) do not have a best hit to one of these bacteria by these criteria.

When U. urealyticum proteins are compared to Mycoplasma proteins, 75 have a best hit to M. genitalium and 57 have a best hit to M. pneumoniae, using the criteria stated above (E value of 1e-03 or better and E value 100 times or more smaller than the next best hit). Note that 189 proteins of U. urealyticum identified and conserved proteins (460) have hits to both Mycoplasma with difference in similarities smaller than 10. The trend of more hits to M. genitalium than to M. pneumoniae changes for secreted proteins, with 2 best hits to M. genitalium and 4 to M. pneumoniae (out of 50 U. urealyticum secreted proteins). 15 hits are observed to both Mycoplasma with difference in similarities smaller than 10.

When U. urealyticum proteins are compared to Clamydia proteins, 35 have a best hit to C.pneumoniae and 31 have a best hit to C.trachomatis, using the criteria stated above (E value of 1e-03 or better and E value 100 times or more smaller than the next best hit).

When U.urealyticum proteins are compared to B.burgdorferi and T.pallidum proteins, 101 have a best hit to B.burgdorferi and 35 have a best hit to T.pallidum, using the criteria stated above (E value of 1e-03 or better and E value 100 times or more smaller than the next best hit).

PARALOGS

The original concept of paralog contained the notion that functional differences would evolve after duplication -- that paralogs were not orthologs. In the absence of biochemical data, sequence analysis can't be certain that paralogous relationships will be based on different functions, hence the word paralog is used in a loose sense herein and in the literature to denote similar proteins that are not thought to be orthologous. When duplicated genes with the same activity are being discussed, we will try to use the term isolog, leaving the term ortholog for proteins in different genomes. When paralog is used in a strict sense, different functional activities are implied.