PMC:539965 / 1272-1274
DG-CST (Disease Gene Conserved Sequence Tags), a database of human–mouse conserved elements associated to disease genes
Abstract
The identification and study of evolutionarily conserved genomic sequences that surround disease-related genes is a valuable tool to gain insight into the functional role of these genes and to better elucidate the pathogenetic mechanisms of disease. We created the DG-CST (Disease Gene Conserved Sequence Tags) database for the identification and detailed annotation of human–mouse conserved genomic sequences that are localized within or in the vicinity of human disease-related genes. CSTs are defined as sequences that show at least 70% identity between human and mouse over a length of at least 100 bp. The database contains CST data relative to over 1088 genes responsible for monogenetic human genetic diseases or involved in the susceptibility to multifactorial/polygenic diseases. DG-CST is accessible via the internet at http://dgcst.ceinge.unina.it/ and may be searched using both simple and complex queries. A graphic browser allows direct visualization of the CSTs and related annotations within the context of the relative gene and its transcripts.
INTRODUCTION
Alignment of DNA sequences from different species provides an effective tool to decode genomic information, based on the assumption that functional sequences tend to evolve at a slower rate than non-functional sequences. The availability of the complete genomic sequences from a variety of species (1–4) allows to carry out these analyses very effectively and to identify, besides coding sequences, also non-coding sequences with either regulatory or structural functions (5–8).
A comparative analysis of the human and murine genomes revealed the presence of a surprisingly high number of sequence elements longer than 100 bp and displaying a sequence identity >70% between human and mouse (6). Interestingly, more than half of these conserved sequences do not represent known elements belonging to protein-coding genes and may therefore represent non-coding RNAs, expression control elements or chromosomal structural elements. Such sequences have been previously termed CNG (conserved non-genic sequences) (9,10) or CNS (conserved non-coding sequences) (2). Here, we use the more neutral and descriptive expression ‘conserved sequence tags’ (CST), which is appropriate also to describe exons.
To gain further insight into the biological role of these conserved sequences, we chose to identify and annotate CSTs belonging to a set of human genes involved in the pathogenesis of genetic diseases. These are among the best-studied human genes as they have been the objects of very detailed structural and functional characterization in the past 15–20 years. Furthermore, novel functional elements within these genes may be targets of yet unidentified mutations leading to genetic diseases. Information on CSTs related to human disease genes can also be gathered from reference genome sequence databases, i.e. Ensembl (11) and Genome Browsers (12), or from more specialized resources, i.e. Vista Browser (13) and GALA (14). However, these valuable resources are not specifically designed for the study of human disease genes and retrieval of CST data for these genes may turn out to be difficult and statistical analysis impossible since CSTs are not explicitly annotated. Therefore, we decided to build DG-CST (Disease Gene CST), a database of human–mouse conserved elements associated to disease genes. To this purpose, systematic identification of CSTs in human disease genes was carried out, followed by detailed bioinformatic analysis, aimed at identifying novel functional elements associated with these genes, either transcribed and possibly coding sequences or non-transcribed sequence elements with a hypothetical role in the control of gene expression. The DG-CST database is available to the scientific community through a Web interface at the address http://dgcst.ceinge.unina.it/. The annotation of CSTs related to disease genes will be valuable for the elucidation of the functional role of these conserved sequences and for a better understanding of the pathogenesis of human genetic disorders.
CONSTRUCTION AND ORGANIZATION OF THE DG-CST
Sequence acquisition and CST identification
A list of human genes involved in either the pathogenesis of monogenic human disorders or in the predisposition to multifactorial diseases was obtained by screening the Genecards (15) and the On-Line Mendelian Inheritance in Man (OMIM) (16) databases. We then searched the human Ensembl database (assembly release NCBI34) to retrieve the human genomic sequences spanning the selected transcripts as well as 250 additional kilobases of flanking sequence on both sides. The extent of the flanking sequence was reduced when known genes were annotated in proximity of the disease gene, but a minimum of 20 kb was taken in all cases. The Ensembl database was also used as the source of the corresponding murine sequences. Orthologous gene annotation was used, when available, to find the mouse counterparts; when more than one orthologous gene was found, sequences were manually selected, on the basis of overall sequence conservation and relationships with other neighboring sequences. Mouse sequence size was defined according to the length of the human sequence.
A total set of 1088 human genomic sequences was compared to the corresponding murine orthologous genomic sequences (the full list is available online). Overall, 193 million bp of human genomic sequences were analyzed, corresponding to 7% of the human genome. Human and mouse genomic sequences, prefiltered to mask all known repeated sequences, were compared using the BLASTZ program (17). Sequences showing at least 70% identity, over a region of at least 100 bp, were selected and further analyzed to eliminate redundancies, leading to the identification of 66 495 repeat-free, non-overlapping, human and mouse CST pairs. The CSTs were found to correspond or to overlap to known human exon sequences in about 32% of cases (n = 21 139) while they were located either in intronic or in intergenic region in the remaining 68% of cases (n = 45 356) (Table 1).
CST annotation
The identified CSTs are collected in the DG-CST database, together with a large number of annotations including: species;genomic location, i.e. chromosome, position, relationship with the closest gene and with the selected disease gene (often coincident);sequence content, i.e. sequence, length, GC percentage;identity between human and mouse sequences, number of gaps, polarity;BLAST matches with other CSTs, as well as with other human genomic sequences;BLAST matches versus non-redundant nucleotide databases;conservation in other species, as assessed by BLAST analysis versus the drafts of fugu (3), chicken (11), rat (4) and zebrafish (11) genome sequences;classification of CSTs in ‘intronic’, ‘intergenic’, ‘exonic’ based on Ensembl gene annotations;potential of CSTs of representing transcribed/coding elements based on a number of different tests, including determination of maximum ORF size, presence of putative splice sites, exonic splicing enhancers (18), exon predictions based on GENSCAN (19), BLAST matches with expressed sequence tags (ESTs) and non-redundant protein databases, word frequencies, determination of the coding potential score (c.p.s.) according to the CSTMiner algorithm (20,21), a recently developed software based on pairwise genome comparison;presence of single nucleotide polymorphisms (SNPs), as reported in Ensembl;presence of palindromes, tandem repeats, putative RNA secondary structures as predicted by using the ddbRNA software (22);presence of putative transcription factor (TF) binding sites, as assessed using BID, a newly developed algorithm (A. Ambesi, M. Bansal and D. di Bernardo, unpublished data).
DATABASE SEARCH
The DG-CST database contains all the annotations and is designed to allow easy retrieval of CST information. Searching is supported in a number of different ways. A graphic browser allows direct visualization of the CSTs, within the context of the relative gene and its transcripts. Briefly, CST information can be accessed in the following ways: By choosing from a list of all analyzed disease genes available in the home page (Figure 1A and B).By selecting one or more genes either as a quick search option from the home page (Figure 1A, black box) or following the ‘gene’ link. Gene selection may be carried out by gene symbol, disease name and several other criteria, also in combination (Figure 1E).By querying the database for CSTs selected according to a large number of annotated features, alone or in combination, in the ‘Advanced’ section (Figure 1D). To facilitate the search, reduced feature sets are available where CSTs can be searched by (a) DNA-based features such as presence of tandem repeats, palindromes, SNPs (Figure 1C); (b) RNA-based features such as presence of putative secondary structures, matches with ESTs, GENSCAN predicted exons; (c) protein-coding features, such as exon annotation, coding potential, BLAST matches with proteins; (d) CSTs localized to selected chromosomal regions.Finally, CSTs can be searched by BLAST sequence analysis from the home page (Figure 1A, red box). Each CST entry present in DG-CST is assigned a unique identifier (CST ID) that can also be used to quickly find the CST from different sections of the database, including the home page.
DATA DISPLAY
When searching DG-CST using the previously described ‘DNA-based’, ‘RNA-based’, ‘protein-based’, ‘advanced’ and ‘localization’ features, a list of CST entries that meet the search criteria can be accessed. Individual CSTs may be visualized in a specific page where all annotations available on that particular CST are displayed (Figure 2C). Matching CSTs from other species may be seen and compared in a multi-sequence alignment (Figure 2D). Matches found for each CST in a number of BLAST searches, pre-run against collections of genomic, EST or protein databases, may also be displayed starting from the CST page.
On the other hand, when searching by gene/disease name and/or symbol, it is possible to obtain a list of gene entries that meet the search criteria. Each gene entry, in addition to links to external resources such as LocusLink, ENSEMBL and OMIM, provides a ‘CST list’ link that gives access to the list of all CSTs found by analyzing the selected disease gene region, as shown in Figure 2A. By clicking on each entry, it is possible to access all the data pertaining to a given human CST, as described above.
Graphical representation is accessible through a ‘map’ link, where CSTs and related annotations are shown within the context of the relative gene and its transcripts (Figure 2B). Moving through the genomic region and zooming to various levels of detail are supported. CSTs may be labeled by a color code on the basis of several quantitative parameters such as degree of human–mouse sequence identity, GC content, number of gaps, putative RNA secondary structures, palindromes and tandem repeats. To avoid an exceedingly crowded map, the graphic visualization tool allows the user to display selected CST subsets, such as: intergenic, intronic or exonic CSTs;CSTs containing putative TF binding sites;CSTs with matches to ESTs;CSTs conserved in additional species, besides human and mouse, such as chicken, fugu, zebrafish. These CSTs have a higher probability of representing functional elements playing a basic role in vertebrates as suggested by recent reports (23).
CONCLUSIONS
DG-CST is an annotated collection of conserved sequences related to genes involved in genetic diseases and may represent a valuable resource for investigators interested in studying the molecular mechanisms that underlie genetic diseases. The database will be updated on a regular basis to include information on newly identified human disease genes as well as on new genomic data (e.g. sequences from additional organisms). DG-CST may help in deciphering the spectrum of pathogenetic mutations that determine genetic diseases. Mutations are usually searched for in the coding regions of a gene, but may easily occur in other areas. CSTs provide a vast library of putative novel functional sites, such as non-previously described exons and/or elements possibly playing a role in regulating the level of gene expressions, which may be functionally tested as well as screened for mutations in patients, particularly in diseases where the analysis of the known functional elements of the disease gene failed so far in identifying a relevant number of causative mutations (24–27). There are a number of evidences that point to the direct involvement of regulatory control elements in the pathogenesis of human disorders, both due to chromosomal rearrangements (28,29) and to point mutations (30–32). However, the recognition of pathogenic mutations leading to genetic disorders in regulatory elements has been so far hampered by our limited knowledge of the structure and function of the elements associated to disease genes. The availability of the DG-CST database should be a valuable resource in order to fill this gap of information and to facilitate the efforts aimed at both elucidating the function of disease genes and at better understanding the pathogenetic mechanisms of genetic diseases.
|