IMPLEMENTATION
We use the PEDANT software suite (4) for annotation of large amounts of protein sequences by a carefully selected set of established bioinformatics methods. Exhaustive functional characterization of protein sequences includes similarity searches against the entire non-redundant sequence database, detection of motifs and patterns, automatic assignment of genes to functional categories and clusters of orthologous groups (5), similarity-based prediction of enzyme classification, and extraction of keywords and superfamily information. Structural characterization of gene products is based on similarity searches against the Protein Data Bank (PDB) (6) database, sensitive recognition of structural domains using profile searches, secondary structure prediction, detection of transmembrane regions, and prediction of low complexity and coiled coil regions. By design, PEDANT provides protein sequence annotation in genomic context. The PEDANT genome browser enables the user to select functional or structural categories of interest, obtain the list of gene products from a particular organism assigned to this category, and then view detailed information on each protein presented as an integrated report page. Advanced DNA and protein viewers allow visualizing the positions of genes and other genetic elements on the chromosome, and predicted structural and functional information about proteins, respectively. Facilities for searching the PEDANT annotation using text queries as well as BLAST (7) and pattern searches are provided.
The PEDANT genome database is produced by systematically applying the automatic annotation pipeline described above to all genomic sequences that are being released in the public domain. The major premises of the PEDANT database are as listed below: Timeliness. The MIPS CPU resources make it possible to process a medium-size prokaryotic genome and make it available online essentially overnight.
Completeness. We seek to process all completely sequenced genomes as well as many incomplete genomes, which are being made available by sequencing centers. In many cases, PEDANT represents the only source of annotation for a given genome.
Standardization. Automatic annotation of sequences follows a clearly defined protocol in terms of the particular set of bioinformatics techniques applied to each sequence and the values of pre-determined recognition thresholds used for individual methods (e.g. BLAST E-values).
Documentation. Since the results of automatic sequence analyses are inevitably afflicted by a large number of false assignments, we make available the raw output of each bioinformatics method used. This allows the user to make his own judgment on the validity of functional predictions appearing on each protein's report page.