> top > docs > @ewha-bio:106

@ewha-bio:106 JSONTXT

HExDB: Human EXon DataBase for Alternative Splicing Pattern Analysis HExDB is a database for analyzing exon and splicing pattern information in Homo sapiens. HExDB is useful for specific purposes: 1) to design primers for exon amplification from cDNA and 2) to understand the change of ORFs by alternative splicing. HExDB was constructed by integrating data from AltExtron which is the computationally predicted exon database, Ensemble cDNA annotation, and Affymetrix genome tile published recently. Although it may contain false positive data, HExDB is good starting point due to its sensitivity. At present, there are as many as 2,046,519 exons stored in the HExDB. We found that 16.8% of the exons in the database was constitutive exons and 83.1% were novel gene exons. Alternative splicing is a versatile mechanism for producing a variety of transcripts and to regulate gene expressions for eukaryotes (Matlin et al., 2005). An accurate splicing mechanism is critical for at least 15% of human genetic diseases that are caused by a splicing error (Cartegni et al., 2002; Caceres et al., 2002; Faustino et al., 2003; Pagani et al., 2004). Recent microarray data combined with ESTs suggest that 73% of human genes are alternatively spliced (Johnson etal., 2003). There are many databases of alternatively spliced genes. Broadly, they can be classified into two. One is based on experimental data while the other is based on computational predictions of the alternative splicing. The experiment based database includes ASDB (Dralyuk etal., 2000), AsMamDB (Ji etal., 2001), Xpro (Gopalan etal., 2004) and AEdb (Thanaraj etal., 2004). The data they processed include bibliography in MEDLINE, sequence data from GenBank (Benson etal., 2004) and SWISS-PROT (Bairoch etal., 2004) database. The computational method based databases include AltExtron (Clark etal., 2002), Asforms (Brett etal., 2001), ASAP (Lee etal., 2003) and TAP (Kan etal., 2002). They determined splicing sites through an examination of alignments from EST and mRNA sequences. The computational approaches, however, can be error prone owing to a limited gene coverage and it does not have a good confidence measure (Thanaraj et al., 2004). In 2004, the European Bioinformatics Institute (EBI) launched ASD, the Alternative Splicing Database, to combine these two approaches which is also a computation based database. Whatever the method, it is necessary for researchers to develop an efficient alternative splicing prediction method and databases. For a more efficient genome-wide disease research, a more sensitive way to find exons is needed. In other words, despite of possible false positives, detecting all the possible exons is an important starting point for alternative splicing research and a database construction. Here, we present a new database that aims for finding maximum number of exons. It integrates many databases for exons such as transcriptional maps of ten human chromosomes published recently (Cheng etal., 2005). To construct HExDB, AltExtron (Clark et al., 2002) and Ensembl cDNA annotations (Hubbard et al., 2005) are used to gather primary exon data. AltExtron is a com­ puter- generated database by EBI. Ensembl annotation is primarily based on biological literature. AltExtron contains specific information about individual exon. However, it does not contain the exact genomic coordinates. Therefore, we transformed (i.e., matched) the sequences to the genomic coordinates using BLAST sequence search algorithm (McGinnis and Madden, 2004) and a genetic database querying. AltExtron has a file format described in its database homepage, www.ebi.ac.uk/ asd / altextron/data/gene _data.html. We used the following three fields in its format Gl, ACC, and AFETS. Ensembl cDNA annotations contain many references. We filtered out much reference information to extract only the position information and the gene IDs exons belong to. The ensembl annotation is stored as EMBL file format. We used BioPerl Module Bio::SeqlO to parse annotated files. All the source codes are accessible from the supplementary material page of our database site. National Human Genome Research Institute launched the ENCODE project. It stands for ENCyclopedia of DNA Elements. As one of the results, Affymetrix inc. proposed 5bp resolution Transcriptome map of ten human chro­ mosomes constructed by a microarray approach (Cheng etal., 2005). Profiles generated by the microarrays were filtered at a threshold to create exon regions. Cheng etal. chose the threshold level for each chip so that 94.8% of the exons were already known in the exon region assignment. Cheng et al. used the term ‘transfrags’ to denote these regions. We followed the same procedure to integrate the transfrag data with other databases collected. Cheng et al. searched for transfrags on ten human chromosomes for eight kinds of cancer cell lines. Due to the imperfection of the threshold-picking algorithm, some transfrag boundaries were not clear and redundant. The redundant transfrags were reduced by clustering algorithms. Introns and both of the 5kb regions, upstream- and downstream-containing genes were obtained from GoldenPath, which is based on known gene information under the April 2003 version of human genome (NCBI v. 33). The types of HExDB exon were classified according to the position of exons within intragenic regions and intergenic regions (Fig. 1). 29,853 exons were integrated into HExDB from AltExtron database. They belonged to 4,113 genes. The average of number of exons was 7.3 per each gene. The distribution of exons within each gene is shown in Fig. 2(a). 299,646 exons were integrated into HExDB from Ensembl database annotation information. Most of them were well positioned on chromosomes. Table 1 and Fig. 2(b) show the statistics of them for each chromosome. However, there were 1572 exons that did not belong to proper chromosome contigs. By Cheng et a/.(2005)’s Transcriptomic map analysis, 1,717,035 exons were found. However, due to unclear boundaries, redundant entries were assigned to the same exons. To reduce this redundancy, we treated overlapped exons as single exon. The final exon number became 499,174. It means the true total exon number is perhaps between the two numbers. Therefore, a further research on accurately reducing redundancy is necessary to get the exact number of human exons. The numbers of transfrags on the ten chromosomes are shown in Table 2. The total number of exons in our HExDB combining all the above source databases was 2,046,519. This figure contains some redundancy from the integration of the source databases (see Table 2). However, we suggest that this is about the upper limit of exons in the human genome. In contrast to the average number of exons per each gene in AltExtron and Ensembl, if we suppose human genome contains around 30,000 genes, 68.2, the number of exons per each gene in HExDB is far larger than that in others. Exon data of HExDB can be accessed by using interactive genome browser as shown in Fig. 3 or by downloading MySQL database file (see Fig. 4). The genome browser was constructed by using custom-track feature of the UCSC genome browser. The purpose of HExDB is to list all the possible exons that can be predicted and annotated by current technology. There can be some redundancy due to this. However, it can give us the estimation and the most number of exon data possible. To do research on alternative splicing, to list all possible exons are needed. The contribution of HExDB, therefore, is to provide biologists useful tool for research on alternative splicing as well as the most comprehensive exon information. We found that there were about two million exons in the human genome from the existing exon data. This number is by no means definite or accurate and will be adjusted in the future. However, we predict that it is close to the upper limit of the total number of human exons. To our surprise, there were a great number of unknown exons. This indicates that the actual number of genes in the genome can be as high as 100,000 due to the high number of exons.

projects that include this document

Unselected / annnotation Selected / annnotation
testing (0)