Title
|
T1DBase, a community web-based resource for type 1 diabetes research
|
Abstract
|
T1DBase (http://T1DBase.org) is a public website and database that supports the type 1 diabetes (T1D) research community. The site is currently focused on the molecular genetics and biology of T1D susceptibility and pathogenesis. It includes the following datasets: annotated genome sequence for human, rat and mouse; information on genetically identified T1D susceptibility regions in human, rat and mouse, and genetic linkage and association studies pertaining to T1D; descriptions of NOD mouse congenic strains; the Beta Cell Gene Expression Bank, which reports expression levels of genes in beta cells under various conditions, and annotations of gene function in beta cells; data on gene expression in a variety of tissues and organs; and biological pathways from KEGG and BioCarta. Tools on the site include the GBrowse genome browser, site-wide context dependent search, Connect-the-Dots for connecting gene and other identifiers from multiple data sources, Cytoscape for visualizing and analyzing biological networks, and the GESTALT workbench for genome annotation. All data are open access and all software is open source.
|
Body
|
INTRODUCTION
T1DBase (http://T1DBase.org) is a public website and database that supports the type 1 diabetes (T1D) research community. T1DBase collects information from public sources and collaborating investigators, integrates this information, and presents it in a form that is useful for, and accessible to, T1D researchers. It is analogous to a model organism database but is focused on a specific disease rather than a specific organism. The site contains multiple semi-independent datasets that are curated independently (in some cases by external collaborators), and then unified using integration software developed for this purpose. Figure 1 shows the homepage. All data are open access and all software is open source.
T1DBase is a merger of two separate projects: one at the Institute for Systems Biology (ISB), which was explicitly funded by the Juvenile Diabetes Research Foundation to create a public resource; the other at the Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory (DIL) of the University of Cambridge, whose mission is to develop tools and methods to integrate genomic and genetic data and improve cost-effectiveness in the search for T1D susceptibility genes (1).
T1D is an autoimmune disease in which the insulin-producing pancreatic beta cells are selectively destroyed. T1D is the second most common form of diabetes with a prevalence of 0.4% in Caucasians (OMIM:222100). The disease is caused by a combination of environmental and genetic factors. While most cases are non-familial, disease risk is dramatically higher (15 times) for siblings of an affected individual; many lines of research confirm that the increased risk is at least partially genetic (2). The HLA region on chromosome 6 confers 40–50% of the genetic susceptibility with lesser contributions from the three other known loci: INS (3), CTLA4 (4) and LYP/PTPN22 (5). It has been estimated that there could be an additional 50 detectable susceptibility loci which are yet to be identified (W. Y. S. Wang, B. J. Barratt, D. G. Clayton and J. A. Todd, submitted for publication).
The major animal models for T1D are the nonobese diabetic (NOD) mouse, and the BB rat. In addition, research on beta cell function is also carried out on non-diabetic mouse and rat strains.
At present, T1DBase is focused on the molecular genetics and biology of T1D susceptibility and pathogenesis. This represents ∼15% of current T1D research and involves more than 1000 investigators, based on a survey of T1D publications.
DATA CONTENT
Annotated genomes
T1DBase provides annotated genome sequence for human, rat and mouse. Currently 32 data tracks are available. The major sources for this information are the public Ensembl (6,7) and UCSC (8) genome databases, augmented by gene annotations produced internally by scientists at DIL and ISB. The DIL, upon dbSNP submission, also publishes its SNPs and primers through T1DBase.
From the large number of data tracks available on Ensembl and UCSC, we have chosen those that are most relevant to T1D investigators. We are in the process of adding tracks and integrating tools not available on Ensembl and UCSC that are of specific interest to our users. Examples appear later.
We are committed to incorporating annotations from scientists in the community. Such annotations are manually reviewed before being added to the database.
T1D prioritized regions
The system has information on prioritized regions in human, rat and mouse.
For human, there are three different kinds of prioritized regions: genetic linkage regions, regions defined through orthology with susceptibility regions in NOD congenic strains and candidate gene regions. There are 20 putative linkage regions, nine orthology regions and numerous candidate gene regions. For linkage and orthology regions, we provide a list of genes in the region. For linkage regions, we also provide a bibliography of publications studying the region, and a summary of LOD scores from the various studies. The candidate genes include ones reported in the literature as having a positive association with T1D, ones reported to be associated with other immune-mediated disorders such as asthma or rheumatoid arthritis that are also candidates for T1D and other genes of interest. The latter includes orthologs of mouse or rat genes associated with T1D, and genes involved in relevant pathways. For genes with literature support, we provide links to the publications.
For mouse, the database contains 27 susceptibility regions defined by the Mouse Genome Database (MGD) (9) with supporting literature. For rat, we have 18 regions and supporting literature from the Rat Genome Database (RGD) (10).
Genetic association studies
We have assembled a dataset of genetic association studies pertaining to T1D in collaboration with the NIH Genetic Association Database (GAD) (11). The dataset covers about 100 genes and 180 publications, including published negative results. Under the collaboration, we carry out literature searches to identify relevant studies and pass the results to GAD for a final quality check and data entry.
NOD strain database
Congenic strains have been created to help identify regions of the NOD genome involved in T1D by introgressing chromosomal regions from resistant strains into the NOD mouse. To visualize the introgressed regions, we have developed a strain database. The strain database has the same data model as the feature database and allows the storage of strain information, such as strain name and its aliases, chromosomes, fine mapping markers and the name of the regions. These data are updated automatically with each update of the NCBI mouse genome build. When an interval is refined, scientists submit the new boundary markers via a webpage and the intervals are recalculated. The intervals are drawn through the Perl GD package. The database and the drawing tools may be of interest to other researchers working with congenic strains. Figure 2 illustrates the way strain information is displayed.
Beta Cell Gene Expression Bank
The Beta Cell Gene Expression Bank is a dataset curated by Decio L. Eizirik and colleagues at the Laboratory for Experimental Medicine at the Free University of Brussels (ULB). There are two main components. One, called the Fast Track, reports the expression level of genes in beta cells under basal conditions and under conditions thought to induce beta cell dysfunction and death in T1D; these data come from a series of microarray experiments conducted in the Eizirik laboratory (12–16). The second component, called the Annotated Track, consists of manual annotation of gene function carried out by beta cell experts; priority is given to genes whose expression is changed when dysfunction is induced. The annotation includes information on the gene's function, its localization, disease association (with special focus on T1D and other autoimmune diseases), other interacting proteins and the phenotype after gene disruption in knockout/transgenic models. Key original references and reviews are provided.
The Fast Track currently has data for about 4500 genes from 30 Affymetrix microarray experiments. The Annotated Track contains more than 300 genes at present and is growing at a rate of 40–60 new genes per month.
Gene x tissue expression
This dataset indicates whether a gene is expressed in a limited set of T1D-relevant tissue types, namely, blood marrow, lymph nodes, pancreas, spleen and thymus based on an analysis of UniGene ESTs. This dataset is being replaced by a more comprehensive resource that combines microarray data from the Beta Cell Gene Expression Bank to characterize expression in beta cells and the GNF SymAtlas (17) to characterize expression elsewhere.
Pathways
Genes are linked to pathways in the KEGG (18) and BioCarta (http://www.biocarta.com) databases. The BioCarta pathways are searched using the Cancer Genome Anatomy Project's (CGAP) Pathway Searcher (http://cgap.nci.nih.gov/Pathways/Pathway_Searcher). For KEGG pathways, it is possible to display a table of the genes involved which indicates whether the gene is located within a T1D candidate region.
Links between datasets
When a user accesses a gene from any dataset on the website, a gene page is displayed that provides links to all T1DBase datasets that contain the gene. From this page, the user can also get to GBrowse and most other tools that can manipulate the gene. In addition, the gene page includes links to the following external resources: LocusLink, UniGene, HomoloGene, OMIM, GeneCards and EPConDB. We are in the process of developing similar links within the major tools on the site, so that tools can use information from any dataset to modify how data are visualized or processed.
TOOLS
Generic Genome Browser (GBrowse)
GBrowse (19) is used to visualize genetic and genomic data (Figure 3). Genomic data are extracted from the Ensembl and UCSC genome databases. The Ensembl database is downloaded after each Ensembl release, and the Ensembl API is used to extract the genome features of interest. These are converted into genome feature format (GFF) and loaded into the GBrowse database. From UCSC, certain data types, notably the UCSC mRNA and EST homologies are downloaded, converted into GFF and loaded into the GBrowse database. Currently 32 data tracks are available. Efforts are underway to integrate statistical tools such as selection of tag SNPs (20) and display of D′/R2 plots for an interval of interest.
An alternative approach to integrating the Ensembl and UCSC data would be to use distributed annotation server (DAS) (21). However the current specification of DAS only allows a limited glyph set, and does not, for instance, allow graphs to be represented.
We make extensive use of the plugin capability provided by GBrowse. A plugin is used to visualize the UCSC dataset of regulatory potential scores (22). This is a very large dataset, which we prefer not to store in our main GBrowse database. Instead, it is imported into a separate database and uses a plugin to connect GBrowse to the data. Similar plugins are used to visualize Fugu net scores and repeat density plots. We expect to add more plugins as we integrate additional data tracks that do not fit the built-in GBrowse model.
Another plugin facilitates genome annotation. The plugin uses BLAT (23) to align an mRNA sequence to the genome and convert the result into a GFF file. The user can then upload the file and view the annotation in GBrowse. To add the annotation to the permanent database, the user can email the GFF file to T1DBase, the file is then manually verified and loaded into the database.
We also use plugins to allow users to export selected data tracks to a file.
The T1DBase GBrowse provides the T1D research community with a rich genomic data environment by integrating the UCSC and Ensembl genomes and user contributed data.
Search
T1DBase offers a site-wide search capability that works across the multiple datasets present on the site. A technical subtlety is that different kinds of data require different search strategies which the software carries out behind the scenes. Genes are an important special case: the software can search for genes based on a variety of identifiers, including gene names, symbols, LocusLink IDs and UniGene IDs.
The search system is built on the open source Plucene package, a Perl port of the widely used Lucene package (24) (http://www.onjava.com).
Connect-the-Dots
Connect-the-Dots connects identifiers for genes and other entities based on information extracted from multiple data sources. It provides methods for parsing data sources to extract identifiers and connections among identifiers, and loading this information into an internal database. Users can query the database to connect identifiers from any number of sources by following paths composed of the parsed connections. For example, to find literature citations about genes of interest on an Affymetrix chip, a query can connect Affymetrix probeset identifiers to LocusLink identifiers using information from Affymetrix's annotation files and connect the LocusLink identifiers to PubMed identifiers using information in NCBI's LocusLink files. Longer and more complex paths are also possible. Queries are expressed in a special-purpose query language and are translated into SQL by the software.
The system can be used interactively over the Web, or as a batch resource to create specialized translation tables for specific purposes. Many of the translation tables used internally by T1DBase are constructed in this manner.
The current Connect-the-Dots database has information from LocusLink, UniGene (human, mouse and rat), OMIM, IPI, UniProt, HomoloGene, DoTS, several Affymetrix chips, and human and mouse PancChips (pancreas/islet-specific microarrays). The database contains 20 million unique identifiers and 42 million connections extracted from 2 million data source entries.
Cytoscape
Cytoscape (25) is a tool for visualizing and analyzing biological networks, defined broadly to include any collection of interacting bio-molecules. A common use of the software is to display networks of protein–protein and protein–DNA interactions, but it can also be used to display gene networks. A key feature is that Cytoscape can analyze networks in combination with gene expression data, e.g. to discover sub-networks with correlated expression, and annotation data such as Gene Ontology, e.g. to associate sub-networks with biological functions.
Cytoscape can be launched directly from T1DBase, although at present this only works on two demonstration networks. Work is underway to connect Cytoscape to human protein interaction data from HPRD (26), microarray gene expression data from the Beta Cell Gene Expression Bank and other sources and annotations suggesting association with T1D susceptibility.
GESTALT
GESTALT (27) is a workbench for genome annotation that combines automated and manual analysis with an emphasis on rich graphical display of the analysis results. GESTALT can execute a variety of external analysis programs (e.g. for gene recognition) as well as internal analyses (e.g. for compositional complexity analysis). The results are stored in an internal database and can later be retrieved and displayed.
GESTALT analyses have been carried out on most T1D human candidate regions, and the results can be inspected on T1DBase. Several new genes were found through this analysis. For operational reasons, users are not allowed to run their own GESTALT analyses on our website, but can do so on the public GESTALT server at http://db.systemsbiology.net/gestalt/.
IMPLEMENTATION ISSUES
Remapping of features
Local features—meaning annotations that are not in Ensembl or UCSC—are stored in a feature database. The feature database was intended to be a Bio::DB::GFF-shaped database, as used by GBrowse; however, user accountability was required over database inserts, edits and deletes, so various modifications and additions were introduced. The variable GFF field 9 was replaced with a defined set of attributes for each feature type. For each feature, the NCBI build number is linked to the feature's coordinates and these are stored together with the sequence. The database is checked on a daily basis for unmapped features, and the sequences for these features are extracted and mapped onto the genome using BLAT. This storage of sequence also allows for easy remapping after an update of the genome build.
When Ensembl or UCSC issue new releases, we reimport their data and rebuild our GBrowse database from scratch. We then extract local features from the feature database, remap these onto the genome using BLAT and add the remapped features to the GBrowse database.
The remapping process could be made faster through comparison of the new and old genome releases. For genomic regions that are not changed, it can be assumed that all the features contained within the region still have the same coordinates and need not be remapped. However, remapping is currently not a rate-limiting step, and we have not yet attempted this optimization.
Website implementation
The website is implemented in Perl and runs on Linux with the Apache web server. Most of the website uses MySQL as the underlying database engine; the exception is Connect-the-Dots, which uses PostgreSQL due to the complex queries involved. Essentially all web pages are generated by cgi scripts.
We use common Perl modules (CGI, Apache::Session, Template, DBI) for basic web and database functionality, and developed a page template module on top of these to ensure a common look-and-feel. The page template generates the basic look of each page—top banner, side navigation bar and footer material—and handles processing needs such as session tracking, user logins, page titles, error logging and database connections.
ACCESSING THE WEBSITE
Website navigation conforms to standard web paradigms and should be intuitive to web-literate users. A navigation bar on the left-hand side of each page provides links to the different areas of the site, and a search box in the upper right corner allows the user to quickly search for a feature of interest. Each page has tabs that link to closely related pages, such as a link to the Beta Cell Gene Bank from a gene information page. Pages provide links to past pages to make it easier for people to back up or branch their navigation, and we assign meaningful titles to each page so that navigation aids built into most browsers—back and forward buttons and histories—can be used sensibly.
The database is open access and all the data are available for download. The entire database dump can be downloaded, as can all Cytoscape and GESTALT data files. In addition, a few datasets can be downloaded in more succinct form, including definitions of the T1D candidate regions, and summaries of the genes found in these regions. We are working to make more of our content available in convenient formats.
DISCUSSION
Disease researchers require access to a wide variety of data and software tools. T1DBase is an attempt at such an integration, designed with the needs of scientists working on T1D in mind. Providing a comprehensive set of resources in one place accelerates research by reducing the time scientists have to spend searching the web. The integration of data and tools on T1DBase naturally leads the scientist from initial findings to related information.
T1DBase has been designed to be expanded easily; we expect to add more datasets and tools as the project proceeds. An important near-term goal is to add information on protein–protein interactions observed in islets and beta cells, as this is a major area of T1D research. The datasets are reasonably independent of each other and can be curated and managed separately.
While T1DBase is focused squarely on a single disease, the conceptual design should be applicable for many other diseases. We believe that our software is readily adapted for new systems, and welcome the opportunity to work with other disease communities interested in making this happen.
|
Section
|
INTRODUCTION
T1DBase (http://T1DBase.org) is a public website and database that supports the type 1 diabetes (T1D) research community. T1DBase collects information from public sources and collaborating investigators, integrates this information, and presents it in a form that is useful for, and accessible to, T1D researchers. It is analogous to a model organism database but is focused on a specific disease rather than a specific organism. The site contains multiple semi-independent datasets that are curated independently (in some cases by external collaborators), and then unified using integration software developed for this purpose. Figure 1 shows the homepage. All data are open access and all software is open source.
T1DBase is a merger of two separate projects: one at the Institute for Systems Biology (ISB), which was explicitly funded by the Juvenile Diabetes Research Foundation to create a public resource; the other at the Juvenile Diabetes Research Foundation/Wellcome Trust Diabetes and Inflammation Laboratory (DIL) of the University of Cambridge, whose mission is to develop tools and methods to integrate genomic and genetic data and improve cost-effectiveness in the search for T1D susceptibility genes (1).
T1D is an autoimmune disease in which the insulin-producing pancreatic beta cells are selectively destroyed. T1D is the second most common form of diabetes with a prevalence of 0.4% in Caucasians (OMIM:222100). The disease is caused by a combination of environmental and genetic factors. While most cases are non-familial, disease risk is dramatically higher (15 times) for siblings of an affected individual; many lines of research confirm that the increased risk is at least partially genetic (2). The HLA region on chromosome 6 confers 40–50% of the genetic susceptibility with lesser contributions from the three other known loci: INS (3), CTLA4 (4) and LYP/PTPN22 (5). It has been estimated that there could be an additional 50 detectable susceptibility loci which are yet to be identified (W. Y. S. Wang, B. J. Barratt, D. G. Clayton and J. A. Todd, submitted for publication).
The major animal models for T1D are the nonobese diabetic (NOD) mouse, and the BB rat. In addition, research on beta cell function is also carried out on non-diabetic mouse and rat strains.
At present, T1DBase is focused on the molecular genetics and biology of T1D susceptibility and pathogenesis. This represents ∼15% of current T1D research and involves more than 1000 investigators, based on a survey of T1D publications.
|
Title
|
INTRODUCTION
|
Section
|
DATA CONTENT
Annotated genomes
T1DBase provides annotated genome sequence for human, rat and mouse. Currently 32 data tracks are available. The major sources for this information are the public Ensembl (6,7) and UCSC (8) genome databases, augmented by gene annotations produced internally by scientists at DIL and ISB. The DIL, upon dbSNP submission, also publishes its SNPs and primers through T1DBase.
From the large number of data tracks available on Ensembl and UCSC, we have chosen those that are most relevant to T1D investigators. We are in the process of adding tracks and integrating tools not available on Ensembl and UCSC that are of specific interest to our users. Examples appear later.
We are committed to incorporating annotations from scientists in the community. Such annotations are manually reviewed before being added to the database.
T1D prioritized regions
The system has information on prioritized regions in human, rat and mouse.
For human, there are three different kinds of prioritized regions: genetic linkage regions, regions defined through orthology with susceptibility regions in NOD congenic strains and candidate gene regions. There are 20 putative linkage regions, nine orthology regions and numerous candidate gene regions. For linkage and orthology regions, we provide a list of genes in the region. For linkage regions, we also provide a bibliography of publications studying the region, and a summary of LOD scores from the various studies. The candidate genes include ones reported in the literature as having a positive association with T1D, ones reported to be associated with other immune-mediated disorders such as asthma or rheumatoid arthritis that are also candidates for T1D and other genes of interest. The latter includes orthologs of mouse or rat genes associated with T1D, and genes involved in relevant pathways. For genes with literature support, we provide links to the publications.
For mouse, the database contains 27 susceptibility regions defined by the Mouse Genome Database (MGD) (9) with supporting literature. For rat, we have 18 regions and supporting literature from the Rat Genome Database (RGD) (10).
Genetic association studies
We have assembled a dataset of genetic association studies pertaining to T1D in collaboration with the NIH Genetic Association Database (GAD) (11). The dataset covers about 100 genes and 180 publications, including published negative results. Under the collaboration, we carry out literature searches to identify relevant studies and pass the results to GAD for a final quality check and data entry.
NOD strain database
Congenic strains have been created to help identify regions of the NOD genome involved in T1D by introgressing chromosomal regions from resistant strains into the NOD mouse. To visualize the introgressed regions, we have developed a strain database. The strain database has the same data model as the feature database and allows the storage of strain information, such as strain name and its aliases, chromosomes, fine mapping markers and the name of the regions. These data are updated automatically with each update of the NCBI mouse genome build. When an interval is refined, scientists submit the new boundary markers via a webpage and the intervals are recalculated. The intervals are drawn through the Perl GD package. The database and the drawing tools may be of interest to other researchers working with congenic strains. Figure 2 illustrates the way strain information is displayed.
Beta Cell Gene Expression Bank
The Beta Cell Gene Expression Bank is a dataset curated by Decio L. Eizirik and colleagues at the Laboratory for Experimental Medicine at the Free University of Brussels (ULB). There are two main components. One, called the Fast Track, reports the expression level of genes in beta cells under basal conditions and under conditions thought to induce beta cell dysfunction and death in T1D; these data come from a series of microarray experiments conducted in the Eizirik laboratory (12–16). The second component, called the Annotated Track, consists of manual annotation of gene function carried out by beta cell experts; priority is given to genes whose expression is changed when dysfunction is induced. The annotation includes information on the gene's function, its localization, disease association (with special focus on T1D and other autoimmune diseases), other interacting proteins and the phenotype after gene disruption in knockout/transgenic models. Key original references and reviews are provided.
The Fast Track currently has data for about 4500 genes from 30 Affymetrix microarray experiments. The Annotated Track contains more than 300 genes at present and is growing at a rate of 40–60 new genes per month.
Gene x tissue expression
This dataset indicates whether a gene is expressed in a limited set of T1D-relevant tissue types, namely, blood marrow, lymph nodes, pancreas, spleen and thymus based on an analysis of UniGene ESTs. This dataset is being replaced by a more comprehensive resource that combines microarray data from the Beta Cell Gene Expression Bank to characterize expression in beta cells and the GNF SymAtlas (17) to characterize expression elsewhere.
Pathways
Genes are linked to pathways in the KEGG (18) and BioCarta (http://www.biocarta.com) databases. The BioCarta pathways are searched using the Cancer Genome Anatomy Project's (CGAP) Pathway Searcher (http://cgap.nci.nih.gov/Pathways/Pathway_Searcher). For KEGG pathways, it is possible to display a table of the genes involved which indicates whether the gene is located within a T1D candidate region.
Links between datasets
When a user accesses a gene from any dataset on the website, a gene page is displayed that provides links to all T1DBase datasets that contain the gene. From this page, the user can also get to GBrowse and most other tools that can manipulate the gene. In addition, the gene page includes links to the following external resources: LocusLink, UniGene, HomoloGene, OMIM, GeneCards and EPConDB. We are in the process of developing similar links within the major tools on the site, so that tools can use information from any dataset to modify how data are visualized or processed.
|
Title
|
DATA CONTENT
|
Section
|
Annotated genomes
T1DBase provides annotated genome sequence for human, rat and mouse. Currently 32 data tracks are available. The major sources for this information are the public Ensembl (6,7) and UCSC (8) genome databases, augmented by gene annotations produced internally by scientists at DIL and ISB. The DIL, upon dbSNP submission, also publishes its SNPs and primers through T1DBase.
From the large number of data tracks available on Ensembl and UCSC, we have chosen those that are most relevant to T1D investigators. We are in the process of adding tracks and integrating tools not available on Ensembl and UCSC that are of specific interest to our users. Examples appear later.
We are committed to incorporating annotations from scientists in the community. Such annotations are manually reviewed before being added to the database.
|
Title
|
Annotated genomes
|
Section
|
T1D prioritized regions
The system has information on prioritized regions in human, rat and mouse.
For human, there are three different kinds of prioritized regions: genetic linkage regions, regions defined through orthology with susceptibility regions in NOD congenic strains and candidate gene regions. There are 20 putative linkage regions, nine orthology regions and numerous candidate gene regions. For linkage and orthology regions, we provide a list of genes in the region. For linkage regions, we also provide a bibliography of publications studying the region, and a summary of LOD scores from the various studies. The candidate genes include ones reported in the literature as having a positive association with T1D, ones reported to be associated with other immune-mediated disorders such as asthma or rheumatoid arthritis that are also candidates for T1D and other genes of interest. The latter includes orthologs of mouse or rat genes associated with T1D, and genes involved in relevant pathways. For genes with literature support, we provide links to the publications.
For mouse, the database contains 27 susceptibility regions defined by the Mouse Genome Database (MGD) (9) with supporting literature. For rat, we have 18 regions and supporting literature from the Rat Genome Database (RGD) (10).
|
Title
|
T1D prioritized regions
|
Section
|
Genetic association studies
We have assembled a dataset of genetic association studies pertaining to T1D in collaboration with the NIH Genetic Association Database (GAD) (11). The dataset covers about 100 genes and 180 publications, including published negative results. Under the collaboration, we carry out literature searches to identify relevant studies and pass the results to GAD for a final quality check and data entry.
|
Title
|
Genetic association studies
|
Section
|
NOD strain database
Congenic strains have been created to help identify regions of the NOD genome involved in T1D by introgressing chromosomal regions from resistant strains into the NOD mouse. To visualize the introgressed regions, we have developed a strain database. The strain database has the same data model as the feature database and allows the storage of strain information, such as strain name and its aliases, chromosomes, fine mapping markers and the name of the regions. These data are updated automatically with each update of the NCBI mouse genome build. When an interval is refined, scientists submit the new boundary markers via a webpage and the intervals are recalculated. The intervals are drawn through the Perl GD package. The database and the drawing tools may be of interest to other researchers working with congenic strains. Figure 2 illustrates the way strain information is displayed.
|
Title
|
NOD strain database
|
Section
|
Beta Cell Gene Expression Bank
The Beta Cell Gene Expression Bank is a dataset curated by Decio L. Eizirik and colleagues at the Laboratory for Experimental Medicine at the Free University of Brussels (ULB). There are two main components. One, called the Fast Track, reports the expression level of genes in beta cells under basal conditions and under conditions thought to induce beta cell dysfunction and death in T1D; these data come from a series of microarray experiments conducted in the Eizirik laboratory (12–16). The second component, called the Annotated Track, consists of manual annotation of gene function carried out by beta cell experts; priority is given to genes whose expression is changed when dysfunction is induced. The annotation includes information on the gene's function, its localization, disease association (with special focus on T1D and other autoimmune diseases), other interacting proteins and the phenotype after gene disruption in knockout/transgenic models. Key original references and reviews are provided.
The Fast Track currently has data for about 4500 genes from 30 Affymetrix microarray experiments. The Annotated Track contains more than 300 genes at present and is growing at a rate of 40–60 new genes per month.
|
Title
|
Beta Cell Gene Expression Bank
|
Section
|
Gene x tissue expression
This dataset indicates whether a gene is expressed in a limited set of T1D-relevant tissue types, namely, blood marrow, lymph nodes, pancreas, spleen and thymus based on an analysis of UniGene ESTs. This dataset is being replaced by a more comprehensive resource that combines microarray data from the Beta Cell Gene Expression Bank to characterize expression in beta cells and the GNF SymAtlas (17) to characterize expression elsewhere.
|
Title
|
Gene x tissue expression
|
Section
|
Pathways
Genes are linked to pathways in the KEGG (18) and BioCarta (http://www.biocarta.com) databases. The BioCarta pathways are searched using the Cancer Genome Anatomy Project's (CGAP) Pathway Searcher (http://cgap.nci.nih.gov/Pathways/Pathway_Searcher). For KEGG pathways, it is possible to display a table of the genes involved which indicates whether the gene is located within a T1D candidate region.
|
Title
|
Pathways
|
Section
|
Links between datasets
When a user accesses a gene from any dataset on the website, a gene page is displayed that provides links to all T1DBase datasets that contain the gene. From this page, the user can also get to GBrowse and most other tools that can manipulate the gene. In addition, the gene page includes links to the following external resources: LocusLink, UniGene, HomoloGene, OMIM, GeneCards and EPConDB. We are in the process of developing similar links within the major tools on the site, so that tools can use information from any dataset to modify how data are visualized or processed.
|
Title
|
Links between datasets
|
Section
|
TOOLS
Generic Genome Browser (GBrowse)
GBrowse (19) is used to visualize genetic and genomic data (Figure 3). Genomic data are extracted from the Ensembl and UCSC genome databases. The Ensembl database is downloaded after each Ensembl release, and the Ensembl API is used to extract the genome features of interest. These are converted into genome feature format (GFF) and loaded into the GBrowse database. From UCSC, certain data types, notably the UCSC mRNA and EST homologies are downloaded, converted into GFF and loaded into the GBrowse database. Currently 32 data tracks are available. Efforts are underway to integrate statistical tools such as selection of tag SNPs (20) and display of D′/R2 plots for an interval of interest.
An alternative approach to integrating the Ensembl and UCSC data would be to use distributed annotation server (DAS) (21). However the current specification of DAS only allows a limited glyph set, and does not, for instance, allow graphs to be represented.
We make extensive use of the plugin capability provided by GBrowse. A plugin is used to visualize the UCSC dataset of regulatory potential scores (22). This is a very large dataset, which we prefer not to store in our main GBrowse database. Instead, it is imported into a separate database and uses a plugin to connect GBrowse to the data. Similar plugins are used to visualize Fugu net scores and repeat density plots. We expect to add more plugins as we integrate additional data tracks that do not fit the built-in GBrowse model.
Another plugin facilitates genome annotation. The plugin uses BLAT (23) to align an mRNA sequence to the genome and convert the result into a GFF file. The user can then upload the file and view the annotation in GBrowse. To add the annotation to the permanent database, the user can email the GFF file to T1DBase, the file is then manually verified and loaded into the database.
We also use plugins to allow users to export selected data tracks to a file.
The T1DBase GBrowse provides the T1D research community with a rich genomic data environment by integrating the UCSC and Ensembl genomes and user contributed data.
Search
T1DBase offers a site-wide search capability that works across the multiple datasets present on the site. A technical subtlety is that different kinds of data require different search strategies which the software carries out behind the scenes. Genes are an important special case: the software can search for genes based on a variety of identifiers, including gene names, symbols, LocusLink IDs and UniGene IDs.
The search system is built on the open source Plucene package, a Perl port of the widely used Lucene package (24) (http://www.onjava.com).
Connect-the-Dots
Connect-the-Dots connects identifiers for genes and other entities based on information extracted from multiple data sources. It provides methods for parsing data sources to extract identifiers and connections among identifiers, and loading this information into an internal database. Users can query the database to connect identifiers from any number of sources by following paths composed of the parsed connections. For example, to find literature citations about genes of interest on an Affymetrix chip, a query can connect Affymetrix probeset identifiers to LocusLink identifiers using information from Affymetrix's annotation files and connect the LocusLink identifiers to PubMed identifiers using information in NCBI's LocusLink files. Longer and more complex paths are also possible. Queries are expressed in a special-purpose query language and are translated into SQL by the software.
The system can be used interactively over the Web, or as a batch resource to create specialized translation tables for specific purposes. Many of the translation tables used internally by T1DBase are constructed in this manner.
The current Connect-the-Dots database has information from LocusLink, UniGene (human, mouse and rat), OMIM, IPI, UniProt, HomoloGene, DoTS, several Affymetrix chips, and human and mouse PancChips (pancreas/islet-specific microarrays). The database contains 20 million unique identifiers and 42 million connections extracted from 2 million data source entries.
Cytoscape
Cytoscape (25) is a tool for visualizing and analyzing biological networks, defined broadly to include any collection of interacting bio-molecules. A common use of the software is to display networks of protein–protein and protein–DNA interactions, but it can also be used to display gene networks. A key feature is that Cytoscape can analyze networks in combination with gene expression data, e.g. to discover sub-networks with correlated expression, and annotation data such as Gene Ontology, e.g. to associate sub-networks with biological functions.
Cytoscape can be launched directly from T1DBase, although at present this only works on two demonstration networks. Work is underway to connect Cytoscape to human protein interaction data from HPRD (26), microarray gene expression data from the Beta Cell Gene Expression Bank and other sources and annotations suggesting association with T1D susceptibility.
GESTALT
GESTALT (27) is a workbench for genome annotation that combines automated and manual analysis with an emphasis on rich graphical display of the analysis results. GESTALT can execute a variety of external analysis programs (e.g. for gene recognition) as well as internal analyses (e.g. for compositional complexity analysis). The results are stored in an internal database and can later be retrieved and displayed.
GESTALT analyses have been carried out on most T1D human candidate regions, and the results can be inspected on T1DBase. Several new genes were found through this analysis. For operational reasons, users are not allowed to run their own GESTALT analyses on our website, but can do so on the public GESTALT server at http://db.systemsbiology.net/gestalt/.
|
Title
|
TOOLS
|
Section
|
Generic Genome Browser (GBrowse)
GBrowse (19) is used to visualize genetic and genomic data (Figure 3). Genomic data are extracted from the Ensembl and UCSC genome databases. The Ensembl database is downloaded after each Ensembl release, and the Ensembl API is used to extract the genome features of interest. These are converted into genome feature format (GFF) and loaded into the GBrowse database. From UCSC, certain data types, notably the UCSC mRNA and EST homologies are downloaded, converted into GFF and loaded into the GBrowse database. Currently 32 data tracks are available. Efforts are underway to integrate statistical tools such as selection of tag SNPs (20) and display of D′/R2 plots for an interval of interest.
An alternative approach to integrating the Ensembl and UCSC data would be to use distributed annotation server (DAS) (21). However the current specification of DAS only allows a limited glyph set, and does not, for instance, allow graphs to be represented.
We make extensive use of the plugin capability provided by GBrowse. A plugin is used to visualize the UCSC dataset of regulatory potential scores (22). This is a very large dataset, which we prefer not to store in our main GBrowse database. Instead, it is imported into a separate database and uses a plugin to connect GBrowse to the data. Similar plugins are used to visualize Fugu net scores and repeat density plots. We expect to add more plugins as we integrate additional data tracks that do not fit the built-in GBrowse model.
Another plugin facilitates genome annotation. The plugin uses BLAT (23) to align an mRNA sequence to the genome and convert the result into a GFF file. The user can then upload the file and view the annotation in GBrowse. To add the annotation to the permanent database, the user can email the GFF file to T1DBase, the file is then manually verified and loaded into the database.
We also use plugins to allow users to export selected data tracks to a file.
The T1DBase GBrowse provides the T1D research community with a rich genomic data environment by integrating the UCSC and Ensembl genomes and user contributed data.
|
Title
|
Generic Genome Browser (GBrowse)
|
Section
|
Search
T1DBase offers a site-wide search capability that works across the multiple datasets present on the site. A technical subtlety is that different kinds of data require different search strategies which the software carries out behind the scenes. Genes are an important special case: the software can search for genes based on a variety of identifiers, including gene names, symbols, LocusLink IDs and UniGene IDs.
The search system is built on the open source Plucene package, a Perl port of the widely used Lucene package (24) (http://www.onjava.com).
|
Title
|
Search
|
Section
|
Connect-the-Dots
Connect-the-Dots connects identifiers for genes and other entities based on information extracted from multiple data sources. It provides methods for parsing data sources to extract identifiers and connections among identifiers, and loading this information into an internal database. Users can query the database to connect identifiers from any number of sources by following paths composed of the parsed connections. For example, to find literature citations about genes of interest on an Affymetrix chip, a query can connect Affymetrix probeset identifiers to LocusLink identifiers using information from Affymetrix's annotation files and connect the LocusLink identifiers to PubMed identifiers using information in NCBI's LocusLink files. Longer and more complex paths are also possible. Queries are expressed in a special-purpose query language and are translated into SQL by the software.
The system can be used interactively over the Web, or as a batch resource to create specialized translation tables for specific purposes. Many of the translation tables used internally by T1DBase are constructed in this manner.
The current Connect-the-Dots database has information from LocusLink, UniGene (human, mouse and rat), OMIM, IPI, UniProt, HomoloGene, DoTS, several Affymetrix chips, and human and mouse PancChips (pancreas/islet-specific microarrays). The database contains 20 million unique identifiers and 42 million connections extracted from 2 million data source entries.
|
Title
|
Connect-the-Dots
|
Section
|
Cytoscape
Cytoscape (25) is a tool for visualizing and analyzing biological networks, defined broadly to include any collection of interacting bio-molecules. A common use of the software is to display networks of protein–protein and protein–DNA interactions, but it can also be used to display gene networks. A key feature is that Cytoscape can analyze networks in combination with gene expression data, e.g. to discover sub-networks with correlated expression, and annotation data such as Gene Ontology, e.g. to associate sub-networks with biological functions.
Cytoscape can be launched directly from T1DBase, although at present this only works on two demonstration networks. Work is underway to connect Cytoscape to human protein interaction data from HPRD (26), microarray gene expression data from the Beta Cell Gene Expression Bank and other sources and annotations suggesting association with T1D susceptibility.
|
Title
|
Cytoscape
|
Section
|
GESTALT
GESTALT (27) is a workbench for genome annotation that combines automated and manual analysis with an emphasis on rich graphical display of the analysis results. GESTALT can execute a variety of external analysis programs (e.g. for gene recognition) as well as internal analyses (e.g. for compositional complexity analysis). The results are stored in an internal database and can later be retrieved and displayed.
GESTALT analyses have been carried out on most T1D human candidate regions, and the results can be inspected on T1DBase. Several new genes were found through this analysis. For operational reasons, users are not allowed to run their own GESTALT analyses on our website, but can do so on the public GESTALT server at http://db.systemsbiology.net/gestalt/.
|
Title
|
GESTALT
|
Section
|
IMPLEMENTATION ISSUES
Remapping of features
Local features—meaning annotations that are not in Ensembl or UCSC—are stored in a feature database. The feature database was intended to be a Bio::DB::GFF-shaped database, as used by GBrowse; however, user accountability was required over database inserts, edits and deletes, so various modifications and additions were introduced. The variable GFF field 9 was replaced with a defined set of attributes for each feature type. For each feature, the NCBI build number is linked to the feature's coordinates and these are stored together with the sequence. The database is checked on a daily basis for unmapped features, and the sequences for these features are extracted and mapped onto the genome using BLAT. This storage of sequence also allows for easy remapping after an update of the genome build.
When Ensembl or UCSC issue new releases, we reimport their data and rebuild our GBrowse database from scratch. We then extract local features from the feature database, remap these onto the genome using BLAT and add the remapped features to the GBrowse database.
The remapping process could be made faster through comparison of the new and old genome releases. For genomic regions that are not changed, it can be assumed that all the features contained within the region still have the same coordinates and need not be remapped. However, remapping is currently not a rate-limiting step, and we have not yet attempted this optimization.
Website implementation
The website is implemented in Perl and runs on Linux with the Apache web server. Most of the website uses MySQL as the underlying database engine; the exception is Connect-the-Dots, which uses PostgreSQL due to the complex queries involved. Essentially all web pages are generated by cgi scripts.
We use common Perl modules (CGI, Apache::Session, Template, DBI) for basic web and database functionality, and developed a page template module on top of these to ensure a common look-and-feel. The page template generates the basic look of each page—top banner, side navigation bar and footer material—and handles processing needs such as session tracking, user logins, page titles, error logging and database connections.
|
Title
|
IMPLEMENTATION ISSUES
|
Section
|
Remapping of features
Local features—meaning annotations that are not in Ensembl or UCSC—are stored in a feature database. The feature database was intended to be a Bio::DB::GFF-shaped database, as used by GBrowse; however, user accountability was required over database inserts, edits and deletes, so various modifications and additions were introduced. The variable GFF field 9 was replaced with a defined set of attributes for each feature type. For each feature, the NCBI build number is linked to the feature's coordinates and these are stored together with the sequence. The database is checked on a daily basis for unmapped features, and the sequences for these features are extracted and mapped onto the genome using BLAT. This storage of sequence also allows for easy remapping after an update of the genome build.
When Ensembl or UCSC issue new releases, we reimport their data and rebuild our GBrowse database from scratch. We then extract local features from the feature database, remap these onto the genome using BLAT and add the remapped features to the GBrowse database.
The remapping process could be made faster through comparison of the new and old genome releases. For genomic regions that are not changed, it can be assumed that all the features contained within the region still have the same coordinates and need not be remapped. However, remapping is currently not a rate-limiting step, and we have not yet attempted this optimization.
|
Title
|
Remapping of features
|
Section
|
Website implementation
The website is implemented in Perl and runs on Linux with the Apache web server. Most of the website uses MySQL as the underlying database engine; the exception is Connect-the-Dots, which uses PostgreSQL due to the complex queries involved. Essentially all web pages are generated by cgi scripts.
We use common Perl modules (CGI, Apache::Session, Template, DBI) for basic web and database functionality, and developed a page template module on top of these to ensure a common look-and-feel. The page template generates the basic look of each page—top banner, side navigation bar and footer material—and handles processing needs such as session tracking, user logins, page titles, error logging and database connections.
|
Title
|
Website implementation
|
Section
|
ACCESSING THE WEBSITE
Website navigation conforms to standard web paradigms and should be intuitive to web-literate users. A navigation bar on the left-hand side of each page provides links to the different areas of the site, and a search box in the upper right corner allows the user to quickly search for a feature of interest. Each page has tabs that link to closely related pages, such as a link to the Beta Cell Gene Bank from a gene information page. Pages provide links to past pages to make it easier for people to back up or branch their navigation, and we assign meaningful titles to each page so that navigation aids built into most browsers—back and forward buttons and histories—can be used sensibly.
The database is open access and all the data are available for download. The entire database dump can be downloaded, as can all Cytoscape and GESTALT data files. In addition, a few datasets can be downloaded in more succinct form, including definitions of the T1D candidate regions, and summaries of the genes found in these regions. We are working to make more of our content available in convenient formats.
|
Title
|
ACCESSING THE WEBSITE
|
Section
|
DISCUSSION
Disease researchers require access to a wide variety of data and software tools. T1DBase is an attempt at such an integration, designed with the needs of scientists working on T1D in mind. Providing a comprehensive set of resources in one place accelerates research by reducing the time scientists have to spend searching the web. The integration of data and tools on T1DBase naturally leads the scientist from initial findings to related information.
T1DBase has been designed to be expanded easily; we expect to add more datasets and tools as the project proceeds. An important near-term goal is to add information on protein–protein interactions observed in islets and beta cells, as this is a major area of T1D research. The datasets are reasonably independent of each other and can be curated and managed separately.
While T1DBase is focused squarely on a single disease, the conceptual design should be applicable for many other diseases. We believe that our software is readily adapted for new systems, and welcome the opportunity to work with other disease communities interested in making this happen.
|
Title
|
DISCUSSION
|