Documentation

 Table of Contents


Introduction

The documentation in this page gives information about the database structure and content and provides links to detailed information from the other tools that are used.

The Description of fields does not provide a complete listing of all fields in the database but rather an explanation of the records that are displayed in the web site. If you want detailed information about the database tables, see Database Schema.

The coordinate system used in the databases is based on a single file which contains the complete genome sequence. Numbering begins with 1 at the first base in this file. Gene starts, tRNA starts, and rRNA starts are the coordinate of the first base of the first codon for coding sequences, i.e. the first base of the individual tRNA or rRNA molecule. Preprocessing or cotranslational events are not considered. When a database gene record contains a start coordinate that is greater than the stop coordinate, the nucleotide sequence presented is the reverse complement of the sequence contained in the genome nucleotide file.

The Analytical Tools are divided by their access. Web based tools are all available to execute through the internet and the others are stand-alone programs. The links here access the help pages for the programs. An identical list is displayed on the left frame and those links are connected to the tools themselves.

Functional Class Assignments is a list of the classes assigned in this database.

The Database Schema has a complete listing of all the fields in each table of the chosen database.

See the Contact Information to convey your comments, suggestions, correction or concerns.

Back to Table of Contents. 


Description of Fields

Gene Record

Gene ID:
This is a unique, locally assigned  ID (identifier) for records in the database, which in some cases will agree with IDs reported in GenBank  entries.  For the most part, the design of the gene id follows the standard name feature key (i.e. MG001 for M. genitalium).
DNA Molecule Name:
Name of the molecule (i.e. chromosome or plasmid).
GenBank ID:
Unique ID assigned by GenBank when a sequence has been submitted to the database.
BGene ID:
Gene ID that references another database's naming scheme. Not always used.
Definition:
Description of the predicted gene's function.
Gene Name:
Usually a three or four character name; duplicate and triplicate names are common but not all genes have assigned names.
Gene Start:
The coordinate of the first nucleotide of the first amino acid in the predicted protein.
Gene Stop:
The coordinate of the last nucleotide of the codon preceding the predicted stop codon. The stop codon is not the gene stop.
Gene Length:
The length of the nucleotide coding sequence.  Calculated as abs(gene stop - gene start)+1 .  The length of the sequence from the first base of the start codon to the last base of the codon preceding the stop codon. The stop codon is not included in the length.
Molecular Weight:
The molecular weight of the protein, calculated from the protein sequence. Molecular weight that have been determined experimentally will be noted in the comments.
pI:
The pH at which the net charge of the protein is zero, calculated from the protein sequence using the isoelectric command of the GCG package.
Net Charge:
The net charge of the protein in a pH 7.0 environment.  This is calculated from the protein sequence using the isoelectric command of the GCG package.
EC:
Enyzme Commission (EC) numbers refer to enzymatic steps. These numbers are determined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Since they refer to enzymatic steps and not to proteins per se, one enzyme can be assigned more than one EC number.
Functional Class:
Classification of the proposed cellular function. A list of categories used for the bacterial genomes is provided. The format for functional class is a broad class followed by a semicolon and then a more specific class (similar to pathway).
Pathway:
The name of the pathway in which the protein is thought to participate .  Format for the field is similar to that for the functional class field. Each pathway field can contain a main category followed by a semicolon followed by a sub-category followed by a semicolon for each of the main categories.
Primary Laboratory Evidence:
References to experimental lab work pertaining to the sequence itself or to orthologs in the same genus.
Secondary Laboratory Evidence:
References to experimental lab work pertaining to sequences highly similar to the sequence but not to organisms in the same genus.
Comment:
A text field where any specific details about the record can be placed.
Blast Summary:
Summary of results from the sequence alignment tool PSI-BLAST.
COGs Summary:
Summary of the results of the COG (Clusters of Orthologous Groups) analysis.  Scope of relatedness is given in the phylogenetic pattern; best hits are the subset of relations that best match this pattern.
Blocks Summary:
Summary of results obtained from Blocks, a sequence analysis tool for finding ungapped segments corresponding to the most highly conserved regions of proteins.
ProDom Summary:
Summary of results obtained from the protein domain database search tool, ProDom.
Paralogs:
A term that has been used equivocally to denote 1) similar sequences that have arisen through duplication prior to diversification; different functions are presupposed; 2) similar sequences in a single organism that in some instances would be better termed isologs. In the absence of biochemical information, paralogs in this database are homologs that are not obviously orthologs.
Pfam Summary:
Summary of results obtained from the protein family database, Pfam.
Structural Feature(s):
Structural features are predicted using PHD, SignalP, SEG, and Coils. These are stored in a custom structure within one field in the database.
PDB Hit:
Results from BLAST hits to  sequences in PDB (Protein Data Bank). Use PDB to search for 3-D macromolecular structures.
Gene Protein Sequence:
The amino acid sequence from the first codon to the codon preceding the stop codon.
Gene Nucleotide Sequence:
The nucleotide sequence from the first base of the start codon to the last base of the codon preceding the stop codon.

Back to Table of Contents. 


Intergenic Space Record

IGS ID:
Unique identifier assigned by the database. Always begins with IGR (intergenic region) and is followed by a number. The number is a relative position marker, that is to say that IGR1 is closer to the first base in the genome nucleotide file than is IGR2. IGR numbers do not correspond with gene IDs.
DNA Molecule Name:
Name of the molecule (i.e. chromosome or plasmid).
IGS Start:
The first base of the intergenic region on the plus or top strand.
IGS Stop:
The last base of the intergenic region on the plus or top strand.
Features:
Description of structural features such as genes contained in an IGS such as tRNA and rRNA genes.
Comment:
A text field where any specific detail about the record can be placed.
IGS Nucleotide sequence:
The nucleotide sequence of the intergenic region. The sequence represents the plus or top strand since there is no directionality associated with an intergenic region. Features within an intergenic region may have directionality; this information is stored in another table. For example the directionality of a tRNA molecule would be stored in the tRNA table.

Back to Table of Contents. 


tRNA Record

tRNA ID:
This is the primary key in the tRNA table.  An example of a tRNA_ID  is tRNA-Arg-4.  This is for the fourth arginine tRNA in the genome.  Ordering is by start coordinate with tRNA-Arg-1 having the smallest start coordinate of any of the arginine tRNA's.
DNA Molecule Name:
Name of the molecule (i.e. chromosome or plasmid).
tRNA Start:
The first base of the predicted mature tRNA molecule.  If the start is greater than the stop, then the gene is on the reverse or bottom strand.
tRNA Stop:
The last base of a predicted tRNA molecule.
IGS ID:
Unique identifier assigned by the database. Always begins with IGR (intergenic region) and is followed by a number. The number is a relative position marker, that is to say that IGR1 is closer to the first base in the genome nucleotide file than is IGR2.
Unique ID that corresponds to one entry in the IGS table.
Anticodon:
The three letter nucleotide sequence of the tRNA molecule which acts as the anticodon.
tRNA Nucleotide Sequence:
Nucleotide sequence of the tRNA.
Comment:
A text field where any specific detail about the record can be placed.

Back to Table of Contents. 


rRNA Record

rRNA ID:
This is the primary key in the rRNA table. The syntax of the ID is a number (which is the standard weight in terms of sedimentation properties) followed by S followed by rRNA followed by _ followed by a number (this number is used to distinguish rRNA molecules of the same weight). For example, 16SrRNA_1. The last two characters (_1) may be omitted if there is only one rRNA operon in the genome, as is the case in Mycoplasma genitalium.
DNA Molecule Name:
Name of the molecule (i.e. chromosome or plasmid).
rRNA Start:
The first base of the rRNA gene. If the start coordinate is less than stop coordinate then the gene is coded for on the plus or top strand. The plus or top strand is defined by the primary sequence of the genome as submitted to GenBank.
rRNA Stop:
The last base of the rRNA gene.
IGS ID:
Unique identifier assigned by the database. Always begins with IGR (intergenic region) and is followed by a number. The number is a relative position marker, that is to say that IGR1 is closer to the first base in the genome nucleotide file than is IGR2. p
rRNA Nucleotide Sequence:
Nucleotide sequence of the rRNA.
Comment:
A text field where any specific details about the record can be placed.

Back to Table of Contents. 


Repeat Record

Repeat Name:
Unique identifier assigned and used by the database.
Repeat Type:
A description of the type of repeat (i.e. tandem, inverted, direct, etc.)
DNA Molecule Name:
Name of the molecule (i.e. chromosome or plasmid).
Repeat Unit Coordinates
Start:
Start coordinate for each unit of the repeat.
Stop:
Stop coordinate for each unit of the repeat.
Comment:
A text field where any specific details about the record can be placed.

Back to Table of Contents. 


Analytical Tools

Genome Alignment

Bugspray is an application that creates a graphical representation of an alignment of two genomes. The alignment is based on similar genes. It is often assumed that similarity represents homology. The tool being used to measure similarity is BLAST2 (the cutoff for a best hit is p = 0.0001). The program connects the start of a protein from the query genome with the respective start of the best hit from the subject genome. A green line represents a pair of genes which are found on strands of the same sense, and a red line represents a pair of genes which are found on strands of the opposite sense.

Back to Table of Contents. 


Local BLAST

Local BLAST Search is a regular BLAST search program performed against our local databases at the Los Alamos National Laboratory (LANL), rather than at the National Center for Biotechnology Information (NCBI). In addition to the nr and nt databases which are downloaded from NCBI monthly, our local databases also include many bacterial and viral databases located at LANL. Local BLAST Search allows BLAST searches against the same genome for paralogs as well as against any specific bacterial or viral database of interest. Click Here for help on general BLAST searching.

Back to Table of Contents. 


PSI-BLAST

BLAST (Basic Local Alignment Search Tool) is a set of similarity search programs designed to explore all of the available sequence databases regardless of whether the query is protein or DNA. For literature references please see Goodman L 1997 "More blast for the buck." Genome Research. 7:858-859 and Atschul AF et al., 1997 "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucl. Acid Res.  25:3389-3402.

PSI-BLAST searches are iterated, with a position specific scoring matrix. It seeks to identify single gapped alignments, rather than a collection of ungapped alignments.  The matrix used in the i+1th iteration is computed based on significant alignments found in the ith  iteration.  The success of the method of iterative blast searching depends on the quality of the matrix produced in the previous iteration.  This in turn depends on the homologous nature of the set of sequences which match the query above some BLAST E-value.   Weighting is performed on the set of sequences used to generate the matrix according to Heinkoff D and Heinkoff JG 1994 J. Mol. Biol. 216:813-818, so that sequences in the set that have high similarities are not weighted as much as those from a smaller set of more divergent sequences.
 

Bastpgp arguments and argument values most commonly used.
Argument Description Value
-v Number of one line descriptions to display (default 250). 10
-b Number of alignments to display (default 250). 10
-m Alignment view (default 0) 3
-I Show GI's in the defline (default F). T
-a Number of processors to use (default 1). 2
-F Filter query sequence with SEG (default F). T

Back to Table of Contents. 


COGs

COGs  stands for Cluster of Orthologous Groups of proteins. The proteins that comprise each COG are assumed to have evolved from an ancestral protein, and are therefore either orthologs or paralogs. COGs were delineated by comparing protein sequences encoded in 21 complete genomes, representing 17 major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain.

There are two basic issues to understanding COG analysis:.  1) how the COG database has been built; 2)  how one uses this database for the purpose of annotation.   The first issue, how the database is built, is accomplished by doing pairwise comparisons of the 43,897 proteins in the 21 complete genomes listed in the following table. For each protein, the best hit (BeT) in each of the other genomes was detected.  A COG is then defined by a relationship of BeTs.  The second issue, using the database, is accomplished by BLASTing an unknown sequence against the set of all genomes in the COGs database, and looking for the case in which the unknown sequence has BeTs to more than one member of the COG.

A phylogenetic pattern is a series of lowercase letters, uppercase letters, and/or dashes that is a shorthand representation of the presence or absence of proteins from a particular organism in the COG of interest. Each letter in a pattern represents a particular organism, given in the table below, along with the pattern position assigned to that organism. Uppercase letters indicated that at least two orthologs belong to that COG.

Organism Name and Abbreviation
Organism Name Code
Archaeoglobus fulgidus a
Methanococcus jannaschii m
Methanobacterium thermoautotrophicum t
;Pyrococcus horikoshii k
Saccharomyces cerevisiae y
Aquifex aeolicus q
Thermotoga maritima v
Synechocystis sp. PCC6803 c
Escherichia coli e
Bacillus subtilis b
Mycobacterium tuberculosis r
Haemophilis influenzae h
Helicobacter pylori 26695 u
Helicobacter pylori J99 j
Mycoplasma genitalium g
Mycoplasma pneumoniae p
Borrelia burgdorferi o
Treponema pallidum l
Chlamydia trachomatis i
Chlamydia pneumoniae n
Rickettsia prowazekii x

The phylogenetic pattern, -----qvcE-------o---x, for example, would indicate that Aquifex aeolicus, Thermotoga maritima, Synechocystis sp. PCC6803, Borrelia burgdorferi and Rickettsia prowazekii have one ortholog which belongs to the COG and Escherichia coli has at least two that belong to the COG.

COG Functional Class Abbreviations
Information storage and processing
J Translation, ribosomal structure and biogenesis
K Transcription
L DNA Replication, recombination, and repair
Cellular processes
D Cell division and chromosome partitioning
M Cell envelope biogenesis, outer membrane
N Cell motility and secretion
O Posttranslational modification, protein turnover, chaperones
P Inorganic ion transport and metabolism
T Signal transduction mechanisms
Metabolism
C Energy production and conversion
E Amino acid transport and metabolism
F Nucleotide transport and metabolism
G Carbohydrate transport and metabolism
H Coenzyme transport and metabolism
I Lipid metabolism
Poorly characterized proteins
R General function prediction only
S Function unknown

For further information see Tatusov RL, Koonin EV, and Lipman DJ.  1997 "A genomic perspective on protein families."  Science. 278:631-637.

Back to Table of Contents. 


ProDom

ProDom (protein domain database) has been designed as a tool to help analyze domain arrangements of proteins and protein families. It consists of an automatic compilation of homologous domains. Current versions of ProDom are built using a novel procedure based on recursive PSI-BLAST searches (Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W & Lipman DJ, 1997, Nucleic Acids Res., 25:3389-3402; Gouzy J., Corpet F. & Kahn D., 1999, Computers and Chemistry 23:333-340.) Large families are much better processed with this new procedure than with the former DOMAINER program (Sonnhammer, E.L.L. & Kahn, D., 1994, Protein Sci., 3:482-492).

Back to Table of Contents. 


Blocks

Blocks are multiply aligned ungapped segments corresponding to the most highly conserved regions of proteins. A database of these blocks has been built and a query sequence is compared for local similarities within the sequence to a block in the database.  Local and global alignments are scored independently so that they can be used in concert to infer homology.

For more information,see Henikoff S and Henikoff JG 1994 "Protein family classification based on searching a database of blocks."  Genomics 19:97-107 and Henikoff S, Henikoff JG, Alford WJ, and Pietrokovski S 1995 "Automated construction and graphical presentation of protein blocks from unaligned sequences." Gene-COMBIS. Gene 163 (1995) GC 17-26.

Back to Table of Contents. 


Pfam

Pfam is a database of multiple alignments of protein domains or conserved protein regions. The alignments represent some evolutionary conserved structure which has implications for the protein's function. Profile hidden Markov models (profile HMMs) built from the Pfam alignments can be very useful for automatically recognizing that a new protein belongs to an existing protein family, even if the homology is weak. Unlike standard pairwise alignment methods (e.g. BLAST, FASTA), Pfam HMMs deal sensibly with multidomain proteins.

Pfam-Pro is a new procaryotic protein family database from TimeLogic. It consists of Hidden Markov Models of protein domains or conserved protein regions. The models in Pfam-Pro have been built from Pfam alignments to a number of completed procaryotic genomes. Unlike Pfam, the models in Pfam-Pro are trained exclusively on procaryotes, and may therefore show an increased selectivity on other Procaryotes.

For more information see The Pfam protein families database. A. Bateman, E. Birney, R. Durbin, S.R. Eddy, K.L. Howe, and E.L.L. Sonnhammer Nucleic Acids Research, 28:263-266, 2000.

Back to Table of Contents. 


PDB

PDB server is used to predict protein 3-D structure based on homologous sequence searching. It uses a version of NRDB that includes all the PDB entries (excluding the BRK_MOD sequences and sequences only containing 'X's). Sequences are compared to this database with PSI-BLAST (Altschul et al, Nucl. Acids Res., 1997), using an e-value cutoff of 0.001, and a maximum of five iterations.

For more information see H.M.Berman, J.Westbrook, Z.Feng, G.Gilliland, T.N.Bhat, H.Weissig, I.N.Shindyalov, and P.E.Bourne. 2000. The Protein Data Bank. Nucleic Acids Research, 28, 235-242 and M.Huynen, T.Doerks, F.Eisenhaber, C.Orengo, S.Sunyaev, Y.P.Yuan, and P.Bork. 1998. Homology-based fold prediction for Mycoplasma genitalium proteins. J. Mol. Biol. 280, 323-326.

Back to Table of Contents. 


Entrez

Entrez is NCBI's search and retrieval system. With Entrez, one can search DNA and protein sequence databases, complete genomes, 3-D protein structures, population sequences and literature.

Back to Table of Contents. 


SEALS

The SEALS package is designed specifically for large-scale research projects in bioinformatics.  It is based on a friendly command line interface in the UNIX environment.  It is scalable and provides dozens of commands which allow the user to quickly answer complex questions.  While the data presented in the STD database is based on specific analysis tools described below, the SEALS package has been invaluable in the linking of various tools, the parsing of the resulting data, and the retrieval of data from standard databases.

For more infromation on the SEALS package, see Walker, DR, and Koonin, EV (1997) SEALS: A System for Easy Analysis of Lots of Sequences. Intelligent Systems for Molecular Biology 5:333-339.

Back to Table of Contents. 


SIGNALP


See Nielsen H, Engelbrecht J, Brunak S, and von Heijne G (1997) "Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites." Protein Engineering 10:1-6.  For a review of signal prediction methods, see Claros MG, Brunak S, and von Heijne G (1997) "Prediction of N-terminal protein sorting signals."  Current Opinions in Structural Biology 7:394-398.

SignalP is an application of neural networks to the problem of identifying protein sorting signals and the prediction of their cleavage sites.  This is possible because these functional units are encoded by linear sequences of ammoniac's rather than a 3D structure.  Reported performance values are presented below in a table reproduced from the Nielsen et al. reference given in the preceding paragraph.
 
 

Source Total number of Proteins Cleavage Site Location (% correct) Signal Peptide Discrimination (correlation)*
Eukaryote 1831 70.2 0.97
Gram - 452 79.3 0.88
Gram + 205 67.9 0.96
* The ability of the method to distinguish between the signal peptides and the N-terminals of nonsecretory proteins is measured by the correlation coefficients (Mathews 1975 Biochim Acta 405:442-451).

Back to Table of Contents. 


Psort

PSORT is a computer program for the prediction of protein localization sites in cells. It receives the information of an amino acid sequence and its source orgin, e.g., Gram-negative bacteria, as inputs. Then, it analyzes the input sequence by applying the stored rules for various sequence features of known protein sorting signals. Finally, it reports the possiblity for the input protein to be localized at each candidate site with additional information. For more help on PSORT, read Psort Users' Manual.

Back to Table of Contents. 


PHD


See Rost, B (1996) "PHD: predicting one-dimensional protein structure by profile-based neural networks." Methods Enzymology 266:525-39.

Back to Table of Contents. 


COILS


See Lupas, A (1996) "Prediction and analysis of coiled coil structures."   Methods Enzymol 266: 513-525.

Back to Table of Contents. 


SEG


See Wootton, JC, and Federhen, S (1996) "Analysis of compositionally biased regions in sequence databases." Methods Enzymol 266:554-571.

Back to Table of Contents. 


tRNAscan-SE


See Lowe, T.M. & Eddy, S.R. (1997)  "tRNAscan-SE: A program for improved detection of transfer RNA genes in genomic sequence."  Nucl Acids Res 25: 955-964.  tRNAscan-SE identifies tRNA genes in genomic DNA sequences (as well as in RNA sequences).  The program uses a modified, optimized version of tRNAscan v1.3 (Fichant & Burks, J. Mol. Biol. 1991, 220: 659-671), a new implementation of a multistep weight matrix algorithm for identification of eukaryotic tRNA promoter regions (Pavesi et al., Nucl. Acids Res. 1994, 22: 1247-1256), as well as the RNA covariance analysis package Cove v.2.4.2 (Eddy & Durbin, Nucl. Acids Res. 1994, 22: 2079-2088). 

Back to Table of Contents. 


Functional Class Assignments - Bacterial

Back to Table of Contents. 


Database Schema

This database was created using MySql, a freely distributed SQL (Structured Query Language) database  server. Choose the organism from the selection list and the "List Fields" button will retrieve a list that contains all the fields for all of the tables.
Database: 


Back to Table of Contents. 


Contact Information

For comments or questions, please contact the Help Desk.

Bioscience Division, B-N1
Los Alamos National Laboratory
TA-43, HRL-1, MS M888
Los Alamos National Laboratory
Los Alamos, NM 87545
 

Back to Table of Contents. 
 


Stdgen Bioinformatics
L O S   A L A M O S   N A T I O N A L   L A B O R A T O R Y

Operated by the University of California for the US Department of Energy

Copyright © 1999 UC -Disclaimer