PMC:539964 JSONTXT 4 Projects

IMGT/GENE-DB: a comprehensive database for human and mouse immunoglobulin and T cell receptor genes Abstract IMGT/GENE-DB is the comprehensive IMGT genome database for immunoglobulin (IG) and T cell receptor (TR) genes from human and mouse, and, in development, from other vertebrates. IMGT/GENE-DB is the international reference for the IG and TR gene nomenclature and works in close collaboration with the HUGO Nomenclature Committee, Mouse Genome Database and genome committees for other species. IMGT/GENE-DB allows a search of IG and TR genes by locus, group and subgroup, which are CLASSIFICATION concepts of IMGT-ONTOLOGY. Short cuts allow the retrieval gene information by gene name or clone name. Direct links with configurable URL give access to information usable by humans or programs. An IMGT/GENE-DB entry displays accurate gene data related to genome (gene localization), allelic polymorphisms (number of alleles, IMGT reference sequences, functionality, etc.) gene expression (known cDNAs), proteins and structures (Protein displays, IMGT Colliers de Perles). It provides internal links to the IMGT sequence databases and to the IMGT Repertoire Web resources, and external links to genome and generalist sequence databases. IMGT/GENE-DB manages the IMGT reference directory used by the IMGT tools for IG and TR gene and allele comparison and assignment, and by the IMGT databases for gene data annotation. IMGT/GENE-DB is freely available at http://imgt.cines.fr. INTRODUCTION IMGT/GENE-DB, part of IMGT, the international ImMunoGeneTics information system®, http://imgt.cines.fr (1–4) is the comprehensive IMGT genome database, which has been developed to classify the immunoglobulin (IG) and the T cell receptor (TR) genes from vertebrate species, and to standardize and manage the complex IG and TR gene data knowledge (5) (http://www.bioinfo.de/isb/2003/04/0004/). The molecular genetics of the IG and TR genes is so complex and unique in the genome of vertebrates (6,7) that a specific gene database was required to manage all their characteristics. Indeed, the synthesis of IG and TR chains involves multigene families from four different gene types: variable (V), diversity (D), joining (J) and constant (C), each one with unique characteristics. These genes are organized in hundreds of cassettes, as in fish, or in large clusters from several hundred kilobases to one (or more) megabase(s), as in mouse and human (6,7). IG and TR genes that belong to same subgroup may be highly similar in their coding sequence, but at the same time, highly polymorphic (e.g. 13 allelic forms have been sequenced for the human IGHV2-70 gene) (6), with alleles displaying different functionalities. The presence of many pseudogenes in the loci, and the frequency of the polymorphisms by gene insertion and deletion in these multigene families, add an additional level of complexity (6,7). Although most human IG and TR genes were sequenced and characterized independently from and before the completion of the Human Genome Project, the classification and the characterization of the IG and TR genes remain a big challenge in the analysis of the genome. Indeed, the annotations of the IG and TR loci, which represent for instance, in human, ∼6 Mb on chromosomes 2, 7, 14 and 22, are not available through classical genome software, owing to the unique IG and TR gene structure (6,7). At the level of gene expression analysis (e.g. cDNAs), data are even more difficult to interpret as the mechanisms involved in the IG and TR synthesis include DNA rearrangements with large DNA deletion of several hundred kilobases, and recombinations, nucleotide deletions and insertions at the rearranged junctions and, for IG, somatic hypermutations. Such somatic mechanisms create an extraordinary diversity of 1012 different IG and TR per individual (6,7). Thus, most IG and TR expressed sequences, available in IMGT/LIGM-DB (8) (http://www3.oup.co.uk/nar/database/summary/504), the IMGT sequence database, and in IMGT/3Dstructure-DB, the IMGT 3D structure database (9) show significant nucleotide and amino acid differences, respectively, by comparison with the germline (not rearranged) sequences. IMGT/GENE-DB has been implemented to provide an easy and common access to standardized and expertly annotated IG and TR gene and allele data and knowledge. The first task of IMGT was to define a reference sequence for each individual gene and allele (6,7), based on the IMGT ‘gene’ and ‘allele’ concepts. IMGT/GENE-DB has been developed using Java and cgi programs and has been available on the Web since January 2003. IMGT/GENE-DB, which currently contains human and mouse IG and TR genes, is the international reference for the IG and TR gene nomenclature. IMGT ‘GENE’ AND ‘ALLELE’ CONCEPTS The IMGT ‘gene’ and ‘allele’ concepts represent the cornerstone of the IMGT-ONTOLOGY ‘CLASSIFICATION’ concept (10) and of the IMGT/GENE-DB implementation. A gene is a DNA sequence that can be potentially transcribed and/or translated (this definition includes the regulatory elements in 5′ and 3′, and the introns, if present). Instances of the ‘gene’ concept are gene names (10). By extension, orphons and pseudogenes are also instances of the ‘gene’ concept (6,7). The IMGT gene names integrate the main CLASSIFICATION concepts of IMGT-ONTOLOGY: the group, the subgroup, the locus and the chromosomal orphon set (10). All IMGT gene names for human IG and TR genes were approved by the Human Genome Organisation (HUGO) Gene Nomenclature Committee (HGNC) (11) in 1999, and entered in the Genome DataBase GDB (Canada) (12), LocusLink and Entrez Gene at NCBI (USA) (13). An allele is a polymorphic variant of a gene, which is characterized by the mutations of its sequence compared to the gene reference sequence designated as allele *01. An IMGT gene or allele name is systematically associated to a species. Each allele is characterized by its functionality and by an IMGT reference sequence (10). The allele functionality, part of the IDENTIFICATION concept of IMGT-ONTOLOGY, has three instances: functional (F), open reading frame (ORF) and pseudogene (P) (10). These instances refer to the V, D and J alleles in their ‘germline’ (non-rearranged) configuration (6,7), and to the C alleles (the configuration of the C genes that do not rearrange is ‘undefined’) (10). An IMGT/GENE-DB allele reference sequence is identified by the IMGT/LIGM-DB accession number, the IMGT gene and allele name, the species, the allele functionality, and the gene core (V-REGION, D-REGION, J-REGION and C-REGION) (10). The sequences of the gene core are extracted from the IMGT/LIGM-DB reference sequences. The IMGT/GENE-DB allele reference sequences are provided in FASTA format with a complete header, for example: For C-REGION encoded by several exons, each exon is provided separately with, in addition, the complete artificially spliced C-REGION. IMGT/GENE-DB CONTENT As on July 2004, IMGT/GENE-DB contained 1375 genes and 2204 alleles from human and mouse (673 IG and TR genes and 1208 alleles from Homo sapiens, and 702 IG and TR genes and 996 alleles from mouse (most entries from Mus musculus, a few entries from Mus cookii, Mus minutoides, Mus pahari, Mus saxicola and Mus spretus) (Tables 1 and 2). This represents the complete set of human IG and TR genes, for all the seven loci (the three IG loci: IGH, IGK and IGL; and the four TR loci: TRA, TRB, TRG and TRD) and for the chromosomal orphon sets (6,7). The mouse entries are complete, except for the mouse IGHV group, which still has a provisional IMGT nomenclature but is near completion. IMGT/GENE-DB QUERY PAGE The IMGT/GENE-DB Query page comprises three types of search (Figure 1): (i) ‘GENERAL CRITERIA’ allows a search of IG and TR genes, for a given species, by locus or chromosomal orphon set, by gene type, group or subgroup, or functionality. The user can select genes that have been found rearranged, transcribed or translated. (ii) ‘SHORT CUT’ allows a selection, for a given species, on gene name or clone name. (iii) ‘IMGT/GENE-DB direct links’ gives access to a set of links, which allow the retrieval of the information related to either one given gene, or to genes of a group using configurable URL, which can be used by humans or programs. IMGT/GENE-DB RESULT PAGE Following a ‘GENERAL CRITERIA’ or a ‘SHORT CUT’ selection, the IMGT/GENE-DB result page (Figure 2) shows, at the top, the user selection, the number of resulting genes and the number of resulting alleles, then the list of resulting genes with, for each gene, the species, IMGT gene name, gene functionality, IMGT gene definition, number of alleles, chromosomal localization and IMGT/LIGM-DB reference sequence(s) for the allele *01 (Figure 2). In the ‘Choose your display’ section, the user can select between three types of display: (i) the complete individual IMGT/GENE-DB entries for the genes selected in the list of resulting genes (an IMGT/GENE-DB entry is described in the next paragraph); (ii) the IMGT/GENE-DB allele reference sequences in FASTA format: nucleotide or amino acid sequences, either with gaps according to the IMGT unique numbering (14–16), or without gaps; (iii) the IMGT label sequences in FASTA format, extracted from expertly annotated IMGT/LIGM-DB reference sequences. This allows to retrieve any label sequence (V-EXON, V-HEPTAMER, etc.), the core regions of out-of-frame pseudogenes, which are not available in the IMGT/GENE-DB allele reference sequences, and the artificially spliced L-PART1+L-PART2 and L-PART1+V-EXON. For nucleotide sequences, the user has the possibility to extend the limits in 5′ or 3′ by typing the number of nucleotides of one's choice. IMGT/GENE-DB ENTRY An individual IMGT/GENE-DB entry provides a full characterization of a gene and of its alleles: IMGT name and definition, chromosomal localization, number of alleles, IMGT reference alleles and other sequences from the literature (as defined in IMGT Gene tables), and for each sequence, allele functionality, clone name, accession number, molecule type. The IMGT/GENE-DB entry gives also access (i) to the IMGT/GENE-DB allele reference sequences in FASTA format [nucleotide and amino acid sequences with gaps according to the IMGT unique numbering (14–16), or without gaps], (ii) to the IMGT Repertoire standardized resources (Chromosomal localization, Locus representation, Tables of alleles, Alignments of alleles, IMGT Protein displays, IMGT Colliers de Perles, etc.) via internal links (‘Locus and genes’, ‘Proteins and alleles’, ‘2D and 3D structures’, ‘Probes and RFLP’, ‘Gene regulation and expression’, ‘Genes and clinical entities’ sections), (iii) to the known IMGT/LIGM-DB cDNA sequences of the gene with a direct IMGT/LIGM-DB query, which then allows the choice of the nine different IMGT/LIGM-DB displays including IMGT/V-QUEST results (17,18), (iv) to the IMGT tools for genome analysis (IMGT/GeneSearch, IMGT/GeneView, IMGT/LocusView, IMGT/GeneInfo) (3,5,19), and (v) to the external links on genome databases LocusLink and Entrez Gene at NCBI, GDB, GeneCards (20), OMIM, MGD (21), sequence databases EMBL (22)/GenBank (23)/DDBJ (24) and nomenclature database HGNC Genenew (11). CONCLUSION AND PERSPECTIVES The central management of gene-related data in IMGT/GENE-DB improves the dynamic generation of knowledge resources from data, which are extracted from the IMGT sequence database IMGT/LIGM-DB, from HTML pages in IMGT Repertoire and from the IMGT tools for genome analysis. Reciprocally, the IMGT/GENE-DB data are used by other IMGT databases (IMGT/PRIMER-DB, IMGT/3D structure-DB) and tools (IMGT/V-QUEST, IMGT/JunctionAnalysis, etc.). The dynamic interactions are currently implemented through IMGT-Choreography (29) based on IMGT-ONTOLOGY and using IMGT-ML Web services. All the mouse IG and TR genes from IMGT/GENE-DB with IMGT reference sequences were provided by IMGT to HGNC and MGD in July 2002. IG and TR genes from genomes of other species (chimpanzee, rat, etc.), as well as members of the immunoglobulin superfamily (IgSF) and of the major histocompatibility complex superfamily (MhcSF) (currently described in the IMGT Repertoire ‘RPI’ section, for the related proteins of the immune system), will be added in IMGT/GENE-DB following the exhaustive analysis of the corresponding genes in IMGT. CITATION Users of IMGT/GENE-DB are requested to cite this article in their publications and to quote the IMGT® home page URL, http://imgt.cines.fr.

Document structure show

Annnotations

blinded