PMC:3245051 4 Projects
|
MetaBase—the wiki-database of biological databases
Abstract
Biology is generating more data than ever. As a result, there is an ever increasing number of publicly available databases that analyse, integrate and summarize the available data, providing an invaluable resource for the biological community. As this trend continues, there is a pressing need to organize, catalogue and rate these resources, so that the information they contain can be most effectively exploited. MetaBase (MB) (http://MetaDatabase.Org) is a community-curated database containing more than 2000 commonly used biological databases. Each entry is structured using templates and can carry various user comments and annotations. Entries can be searched, listed, browsed or queried. The database was created using the same MediaWiki technology that powers Wikipedia, allowing users to contribute on many different levels. The initial release of MB was derived from the content of the 2007 Nucleic Acids Research (NAR) Database Issue. Since then, approximately 100 databases have been manually collected from the literature, and users have added information for over 240 databases. MB is synchronized annually with the static Molecular Biology Database Collection provided by NAR. To date, there have been 19 significant contributors to the project; each one is listed as an author here to highlight the community aspect of the project.
INTRODUCTION
When discussing biological databases, there are simply too many different resources to comprehensively cover the topic in a short introduction. There are well-established data warehouses that act as community repositories for data of a single type such as GenBank (1), PDB (2) and ArrayExpress (3). There are organism-specific databases, combining many different types of data under a unifying, genomic framework such as TAIR (4), FlyBase (5) and WormBase (6). There are databases of derived data, collecting and systematizing the body of knowledge from the scientific literature such as GTEx (http://www.ncbi.nlm.nih.gov/gtex/GTEX2/gtex.cgi), TRANSFAC (7), Brenda (8) and ChEMBL (9). There are competing databases that cover specific kinds of -omics information, collecting data from different experiments within a common biological theme such as DIP (10), HPID (11) and IntAct (12). There are classification databases (13,14), databases of terminology (15,16), databases of protein families (17,18) and databases built around diseases (19) or taxonomic groups (20). This list barely scratches the surface, but gives a flavour of the number, types and diversity of biological databases.
As the type and volume of biological data continues to increase, so do the type and number of databases that analyse, integrate and summarize the available data. For example, querying the database of biomedical publications PubMed (21) shows that the number of unique publications with the word ‘database’ in the title has increased from just 2 in 1980 to 91 in 1990 and 469 in 2000. Since 1990, there has been an exponential increase in the number of database publications per year, reaching over 1000 per year between 2008 and 2010 (Figure 1). If this trend continues, the number of database publications per year will double to nearly 2000 by 2015.
Figure 1. The growth in the number of database publications per year. Each bar shows the number of research articles with the keyword ‘database’ appearing in the article title in the given year. The count only covers articles indexed in PubMed. The increase shows an exponential trend that will produce nearly 2000 database publications per year by 2015.
Biological databases have proven crucially important for basic research, however, the current growth in the available databases creates several problems. Researchers seeking the most up-to-date and comprehensive information in their domain may struggle to identify the definitive sources of reliable data from among the many resources available. Initially, it is difficult to judge the strengths, weaknesses, or status of the available resources without peer guidance. For these reasons, the proliferation of resources may, ironically, lead to an increase in redundancy, as new resources are created to cope with the perceived problems or omissions of existing databases. This process is exacerbated by a lack of public forums where researchers can engage database creators to discuss databases and suggest improvements.
These issues have created an unfortunate situation whereby many resources are short-lived, existing for only a short time before being abandoned. This ‘half-life’ is analogous to ‘link rot’ (22). This creates a vicious cycle, whereby the publication of database resources is devalued (23). To address these problems, we have created MetaBase (MB), a wiki-based database of biological databases.
DATABASE DESCRIPTION
MB is a community-curated database of all the biological databases available on the Internet. The aim of the project is to make it easy for researchers to quickly find relevant information about useful databases. Entries can be searched, queried or browsed by category, and users can contribute, update and maintain the data in many different ways. Each database in MB is described in a semi-structured way using forms and templates. Entries carry data for various fields and allow a free-text description of the resource. In detail, data for each database include a brief description, a URL, a contact email, links to associated literature and various categorization tags. In addition, entries can carry various user comments and annotations.
MB has been implemented using MediaWiki (MW), the same software that powers Wikipedia, probably the best known user-contributed resource in the world (http://wikipedia.org). The MediaWiki system allows users to contribute to the project on many different levels, ranging from authors and editors to curators and site designers. Within the MW system, we created one wiki-page per database entry. The information about each database is structured by using a template with named fields. The template stores data for each database internally using the Semantic MediaWiki extension (http://semantic-mediawiki.org), allowing data to be queried within the wiki directly, by additional extensions or via the semantic web. In particular, we use the Semantic Forms extension (http://www.mediawiki.org/wiki/SF) to allow users to create or edit entries and the Semantic Drilldown extension (http://www.mediawiki.org/wiki/SD) to allow users to explore the database. User comments are collected as free text, just like in Wikipedia.
FEATURES
The MW platform provides a robust base from which to build an online resource. By using MW, many powerful features are provided ‘for free’. The use of MW to support Wikipedia demonstrates the scalability and security of the system, guaranteeing developer support and providing a degree of familiarity to users. Out of the box, MW provides searching, editing, versioning, history and discussion features, as well as user account management and user-email functions. MW includes a powerful extension framework for easily adding functionality.
One criticism of MW is that it provides largely unstructured information, not suitable for advanced searching or reporting. To this end, we employ Semantic MediaWiki and Semantic Forms to create a wiki-database system suitable for maintaining a user-contributed database of information.
DATABASE CONTENTS
Currently, there are 1795 entries in MB, each describing a different biological database. The initial release was derived from the content of the 2007 Nucleic Acids Research (NAR) Database Issue (24). Specifically, each database page was ‘seeded’ with text from the Molecular Biology Database Collection provided by NAR (25). Subsequent releases have been updated into MB on a semi-regular basis. Since the initial release, there have been over 100 user contributed resources added, in addition to 100 resources that were manually collected from the literature. Most of these were taken from database publications in BMC Bioinformatics and BMC Biology. To date, there have been 19 significant contributors to the project, each of whom has been listed as an author on this publication. This step was taken to highlight the community aspect of the MB project. The homepage has been visited approximately 100 000 times. The project has 80 registered users in total, and there have been approximately 15 000 edits. We hope that with ongoing improvements and through increased publicity, usage will continue to grow helping to establish MB as a powerful and referential community resource.
FUTURE DIRECTIONS
In the future, we hope to use MB as a resource to allow more communication between database developers and user communities, acting as a common portal for the biological database community. To achieve this goal, we will automatically register the database's contact email address and add the database's discussion page to that user's ‘watch list’. Comments will then automatically alert the contact, providing them with the opportunity to reply. We hope to add user rating functionality and usage statistics to each resource. This will be done with a combination of existing MediaWiki extensions, adding links to social networking sites and automatic queries to collect the number of citations for each resource. We expect that MB could be used as a source of genuine metadata for data integration projects, and we plan to incorporate ontologies such as EDaM (26,27) and the Biomedical Resource Ontology (28), and to develop links with similar projects such as BioCatalogue (29) and BioDBCore (30).
Finally, we aim to improve the content of MB through an aggressive marketing strategy, contacting the relevant mailing lists, forums and news groups, as well as exploiting the collection of contact email addresses, thereby encouraging the community to contribute to the maintenance of this important resource.
RELATED WORK
MB is by no means unique. There are many related resources, falling into two broad categories: ‘BioWikis’ and ‘databases of biological databases’.
First, there are several other ‘BioWiki’ projects. Like MB, these projects use the tremendously successful MediaWiki software platform to provide user-contributed content to the biological community. For a comprehensive list of important and interesting BioWiki projects, see the BioWiki database on Bioinformatcs.Org (http://bioinformatics.org/wiki/BioWiki). The most successful collection of user-contributed content is Wikipedia (http://www.wikipedia.org/). The success of Wikipedia is intimately related to the success of the MediaWiki software platform, leading to a proliferation of wikis, including several BioWiki projects. However, Wikipedia is still a very important resource for biologists (e.g. http://en.wikipedia.org/wiki/Wikipedia:MCB). Wikipedia maintains a sizeable list of biological databases (http://en.wikipedia.org/wiki/List_of_biological_databases), and many of the databases in MB also have articles in Wikipedia.
Second, there are several ‘databases of biological databases’, which aim to provide a list of all the most important biological databases and data resources available on the Internet. Several prominent biological database collections and related projects are listed in Table 1 (see also http://metadatabase.org/wiki/Help:Related).
Table 1. Projects with a similar scope to MB These projects aim to list the most important biological databases and data resources available on the Internet. For a version of this table that you can edit, see http://metadatabase.org/wiki/Help:Related
DISCUSSION
Biological databases have proven crucially important for basic research. However, exponential growth in the volume of biological data has led to several problems. MB is an international, community-based database that aims to list all the commonly used biological databases in the world. Here, we have created a new scientific-wiki that addresses some of the issues described earlier. The first version of the system was based on a static database of biological databases that has been imported to a wiki system for community annotation. Although similar to several other ‘lists of resources’, MB is unique, being the only truly user-editable list of databases. The NAR Molecular Biology Database Collection is a curated database with strict criteria for inclusion. It covers only a relatively small number of the available molecular biology databases (M. Galperin, personal communication). In contrast, we hope MB, with its liberal wiki-based inclusion policy, might be useful as a wider, more general list with quicker updates.
FUNDING
Industrial Strategic technology development program, (10040231), “Bioinformatics platform development for next generation bioinformation analysis” funded by the Ministry of Knowledge Economy (MKE, Korea). Funding for Open access charge: Genome Research Foundation's internal Biowiki funds.
Conflict of interest statement. None declared.
|
Document structure
Annnotations
blinded