M. GENITALIUM GENOME AND PROTEOME PROPERTIES

COORDINATE SYSTEM

The origin of replication for the 580,070 bp circular genome is an AT-rich region between the dnaA and dnaN ORFs. The first coordinate, taken from this region, is approximately 1 kB upstream of the dnaN coding sequence, the first gene, MG001. Plus strand coding sequences tend to be to the right of MG001 and minus strand coding sequences tend to be to the left (Fraser et al., Science 270:397-403,1995). MG470, the last ORF in the genome by this system, is transcribed from the minus strand and is a predicted SOJ protein coding sequence; MG469 is the dnaA gene. To view the sequence of this putative origin, click on the MG001 gene in the Gene Image Map, then on IGR1.

G + C CONTENT and CODON USAGE

The average G + C content of the genome is 32%. The codon usage as presented on the CUTG Web site (http://www.kazusa.or.jp/codon/) (Nakamura, Y., Gojobori, T. and Ikemura, T. (1997) Nucl. Acids Res. 25, 244-245.) is as follows (based on 176662 codons):

fields: [triplet] [frequency: per thousand] ([number])

    UUU 52.2(  9217)  UCU 12.5(  2209)  UAU 23.9(  4228)  UGU  6.5(  1140) 
    UUC  8.3(  1462)  UCC  4.1(   721)  UAC  8.3(  1467)  UGC  1.6(   283) 
    UUA 49.7(  8772)  UCA 16.5(  2909)  UAA  2.0(   352)  UGA  6.3(  1109) 
    UUG 14.1(  2493)  UCG  1.1(   199)  UAG  0.7(   129)  UGG  3.4(   604) 

    CUU 19.8(  3502)  CCU 14.8(  2616)  CAU 10.3(  1812)  CGU  7.0(  1242) 
    CUC  5.0(   890)  CCC  3.8(   669)  CAC  5.6(   981)  CGC  3.1(   545) 
    CUA 12.6(  2228)  CCA 11.1(  1959)  CAA 38.3(  6773)  CGA  1.3(   235) 
    CUG  4.4(   772)  CCG  1.0(   170)  CAG  8.9(  1576)  CGG  1.0(   179) 

    AUU 51.0(  9005)  ACU 25.4(  4485)  AAU 45.7(  8063)  AGU 26.0(  4584) 
    AUC 17.9(  3163)  ACC 10.5(  1859)  AAC 29.2(  5158)  AGC  6.7(  1186) 
    AUA 12.5(  2215)  ACA 16.6(  2932)  AAA 69.9( 12343)  AGA 14.2(  2506) 
    AUG 15.2(  2679)  ACG  1.7(   293)  AAG 24.4(  4306)  AGG  4.6(   819) 

    GUU 37.7(  6657)  GCU 27.4(  4845)  GAU 42.4(  7486)  GGU 23.0(  4057) 
    GUC  3.5(   617)  GCC  4.1(   727)  GAC  6.9(  1216)  GGC  5.0(   885) 
    GUA 13.2(  2327)  GCA 21.3(  3765)  GAA 45.3(  8005)  GGA 11.6(  2042) 
    GUG  7.2(  1272)  GCG  2.6(   468)  GAG 11.2(  1985)  GGG  7.0(  1229)

GENERAL FEATURES OF THE PROTEOME

Current analyses of this database show 303 identifiable proteins and 88 conserved but uncharacterized proteins; the remaining ORFs are hypothetical. Although many bacterial sequences have been added to GenBank since 1995, when Fraser and coworkers first analyzed M.genitalium, these numbers have not significantly changed. The original assessment of ORFS made use of GeneMark and BLAST. Taking the GenBank deposited sequences as our starting point, each protein was analyzed with respect to similarities outside the genome (orthologs) using gapped BLAST and Psi-BLAST, similarities within the genome (isologs and paralogs) using gapped BLAST, similarities to COGs (clusters of orthologous groups) and to databases that emphasize domains, i.e. BLOCKS and ProDom (Documentation).

Of the 303 identified proteins, 140 have been assigned an EC number. The average molecular weight (length) of the 470 proteins is 41.6 kDa (364 aa); the median molecular weight (length) is 33.7 kDa (294 aa). The average and median estimated pI values are 9.21 and 9.81 (hence M.genitalium is the more basic, Trachomatis the more acidic of the three bacteria currently under study).

MOL WT DISTRIBUTION FOR M.GENITALIUM

x 1000KDa

Of the 470 predicted proteins, 32 appear to be secreted, 116 have transmembrane domains, 128 have coiled-coil regions as determined by suite of structural analysis programs in the SEALS package available from NCBI/NIH. Of 32 proteins predicted to have a signal peptide, 3 had a best hit to E.coli whereas 8 had a best hit to B.subtilis (by criteria stated in the following paragraph). Curiously, the majority of secreted proteins were not that similar to either B.subtilis or E.coli. Nevertheless, signal peptide prediction for M.genitalium was based solely on the B.subtilis (gram positive) model. Of the 470 proteins, 101 have similarity to a PDB database sequence using BLAST2. A hit requires an E value better than (lower than ) 0.0001. A higher fraction has been shown to be homologous to PDB sequences by M. Huynen using Psi-BLAST and filtering (http:www.bork. embl-heidelberg.de/3d/nph-p2-3d).

When M.genitalium proteins are compared to E.coli and B.subtilis proteins, using BLAST2, 34 are found to have a better hit to E.coli proteins, whereas 256 have a better hit to B.subtilis proteins, consistent with the finding that Mycoplasmas are gram-positive. In this analysis, the E value for the best hit to a genome must be 100 times smaller (better) than the E value for the second genome. Also, only E values of 1e-03 and lower are considered. Note that approximately 101 of the M.genitalium identified and conserved proteins (391) do not have a best hit to one of these bacteria by these criteria. Best hits for E.coli and B.subtilis fall to 21 and 216 when archaeal and eukarotic sequences are included in the study, representing 22 and 29 hits respectively.

When M.genitalium proteins are compared to T.pallidum and C.trachomatis proteins, 110 have a best hit to C.trachomatis and 100 have a best hit to T.pallidum, using the criteria stated above (E value of 1e-03 or better and E value 100 times or more smaller than the next best hit). This trend of no preferential similarity holds for secreted proteins, with 4 best hits to pallidum and 5 to Trachomatis (out of 33 genitalium secreted proteins).

PARALOGS

The original concept of paralog contained the notion that functional differences would evolve after duplication -- that paralogs were not orthologs. In the absence of biochemical data, sequence analysis can't be certain that paralogous relationships will be based on different functions, hence the word paralog is used in a loose sense herein and in the literature to denote similar proteins that are not thought to be orthologous. When duplicated genes with the same activity are being discussed, we will try to use the term isolog, leaving the term ortholog for proteins in different genomes. When paralog is used in a strict sense, different functional activities are implied. 138 of the predicted 470 proteins in M.genitalium have similarity to one or more proteins in the organism, using a BLAST2 cutoff E value of 1e-03. Similarities are reported in the Paralog Field of each gene without any distinction between paralogy and isology.