PMC:540086 JSONTXT 3 Projects

NMPdb: Database of Nuclear Matrix Proteins Abstract The nuclear matrix (NM) is a structure resulting from the aggregation of proteins and RNA in the nucleus of eukaryotic cells; it is the ‘sticky bit’ that remains after aggressive DNAse digestion and salt extraction protocols. Owing to the important role of the NM in DNA replication, DNA transcription and RNA splicing, the expression pattern of NM proteins has become an important early indicator for numerous cancers/tumors. Recent descriptions of the NM structure distinguish between a network-like ‘internal nuclear matrix’ (INM) and a ‘nuclear shell’ that connects the INM to the inner and outer nuclear membranes. A cautious NM preparation protocol reveals a coat of proteins on top of the INM; these proteins are usually referred to as the ‘nuclear matrix-associated proteins’. Here, we describe a new database (NMPdb at http://www.rostlab.org/db/NMPdb/) that currently contains details of 398 NM proteins. We collected these data through a semi-automated analysis of over 3000 scientific articles in PubMed. We could match these 398 proteins to 302 protein sequences in UniProt or GenBank. Our NMPdb repository annotates these links along with the following annotations: organism, cell type, PubMed identifier, sequence-based predictions of structural and functional features and for some entries the explicit sequence segment that is responsible for localization (nuclear matrix targeting signal). INTRODUCTION In the early 1960s, researchers began to describe an important nuclear structure in eukaryotic cells that differed from the already well-known DNA/histone-based chromatin (1). This structure, referred to as the ‘nuclear matrix’ (NM), can be separated from the rest of the nucleus by applying DNAse I digestion followed by salt extraction (2). Many functional aspects of the NM have been described; these include DNA replication (3), DNA transcription (4) and DNA repair (5,6). The existence of the NM as an ‘independent’ sub nuclear structure is not a proven reality but a widely accepted hypothesis that has profoundly influenced the literature: PubMed alone retrieves over 3000 articles associated when queried with the terms ‘nuclear matrix’ or ‘nuclear scaffold’. The NM might still be an artificial result of the preparation methods rather than a real in vivo structure (7–9). However, the main facts that argue in favor of the existence of this controversial part of the nucleus are its observation in non-eluted nuclei through electron spectroscopic imaging (10), the existence of protocols to isolate the NM at physiological salt concentrations through electroelution of chromatin (11), the fact that chromatin loops (S/MAR-DNA sequences) bind to a non-chromatin network and finally the description of functional units that stay in their original place even after removing chromatin and soluble proteins from the nucleus (12). Two main structural elements form the NM (13): the ‘internal nuclear matrix’ (INM) and the ‘nuclear shell’ (or ‘nuclear lamina’). The INM is an aggregate of proteins, mainly the intermediate filaments lamins, NuMa (13) and hnRNP proteins (13,14). The nuclear shell links the INM to the nuclear membranes and/or nuclear envelope. Several non-INM proteins can be separated along with the INM through more careful preparation protocols (15,16). These proteins are usually referred to as ‘associated with the nuclear matrix’. The protein composition of nuclear matrices in different organisms and cell types was discovered mainly by 2D gel electrophoresis, a method that separates proteins based on their isoelectric points (first dimension) and molecular weight (second dimension). Nuclear matrices, once separated from the chromatin and the soluble compartments of the nucleus, contain very different proteins in tumor than in non-tumor cells (17,18). In cancer research, these differences provide early indications for different types of tumors. Collecting and analyzing data about NM proteins may help to understand the relationship between those proteins and cancer and to discover NM-associated proteins that have not been implicated with the NM. The vast majority of proteins that have actually been associated experimentally with the NM are not annotated in public databases. Thus, we have built and are maintaining NMPdb, a database with proteins that are associated to the nuclear matrix. DATABASE Nuclear matrix proteins collected from the literature First, we downloaded over 3000 abstracts from PubMed that resulted from queries with the terms ‘nuclear matrix/matrices’ and ‘nuclear scaffold’. Then we wrote a simple Perl script that color-highlighted three types of phrases in the text (through HTML tagging): (i) ‘nuclear matrix’ terms, (ii) UniProt protein names and (iii) verbs describing binding processes such as ‘to bind’, ‘to associate’ or ‘to interact’ (Figure 1). Each abstract was followed by HTML elements that enabled the quick interactive subclassification of each protein into one of the following classes: (i) part of the internal nuclear matrix (INM), (ii)‘tightly’ associated with the INM (ASC), (iii) affinity toward the INM changes depending on protein modification, cell type and/or current stage of the cell cycle (MIX) and (iv) part of the nuclear shell/nuclear lamina (NUS). At this point, we also removed abstracts that contained the search words but did not promise to add information to our database. Finally, we collected the names of the organisms and the cell types in which the interaction with the NM was observed. Content Currently, NMPdb contains over 3000 links to PubMed articles corresponding to about 400 unique proteins; for about 300 of these proteins we could verify the links to their sequences through either UniProt (19) or GenBank (20). Only 62 of all proteins had significant sequence similarity to any protein with known high-resolution information about the 3D structure as deposited in the PDB (21). Only 101 of the 300 proteins were very different in their sequences [HSSP values below 0 (22)], and about half had rather high levels of sequence similarity to at least one other protein in our set (HSSP value >10). Of the 400 proteins, 42, were classified as INM, 198 as ASC and 130 as MIX; very few (currently 13) were classified as NUS. Most proteins (301) are mammalian (predominantly human, rat and mouse); 29 are viral proteins (e.g. HIV, Papyloma/HPV, Epstein–Barr/EBV). Since such viral proteins are typically involved in the transcription of host DNA, it is not surprising that they are an abundant part of the nuclear matrix in infected cells. Other organisms prominent in NMPdb are Gallus gallus (chicken, with 16 proteins), Drosophila melanogaster (fruit fly, with 14 proteins), Saccharomyces cerevisiae (yeast, with 13 proteins) and Caenorhabditis elegans (worm, with 6 proteins). Format and fields NMPdb has been formatted in an EMBL-like flat file format. Each NM protein is represented by one entry. All entries in the database contain the following fields: (i) origin (organism and cell types), (ii) type of nuclear matrix interaction/involvement (INM, ASC, MIX or NUS), (iii) molecular mass and known or calculated pI for locating the protein on a 2D gel and (iv) reference (PubMed IDs of articles describing the interaction). For some entries we provide additional links to other databases, give the actual protein sequence and collect sequence-based predictions. Although links to UniProt implicitly link NMPdb to a variety of other databases, we also provide explicit links to OMIM (23), SWISS-2DPAGE (24) and S/MARt DB (25)—which contains the DNA sequences that the respective protein binds to. We provide the following information for all proteins for which we have sequences: (i) the structural domain-like organization according to CHOP (26,27), (ii) predictions of secondary structure, solvent accessibility and membrane helices through PROFphd [B. Rost, manuscript submitted; (28,29)], (iii) coiled-coil regions through COILS (30), (iv) disordered regions through NORSp (31,32). Where possible, entries are also cross-linked to PEP, a database with predictions for entire proteomes (33) that also contains sequence alignments. For 53 sequences in the database, we found pecific information about which part of the sequence is responsible and necessary for NM binding. These regions, usually referred to as nuclear matrix targeting signals (NMTS), are also deposited in NMPdb if available. Access NMPdb can be accessed from http://www.rostlab.org/db/NMPdb/—a search-engine interface that allows the querying by different database fields and the linking of queries through ‘AND’, ‘OR’ and ‘AND-NOT’. The complete NMPdb database can be downloaded via ftp. The content of the database, the meaning of the fields and the search interface are described in separate help pages. Updates NMPdb annotates many times more proteins as nuclear matrix-associated (∼400) than other public databases such as UniProt (∼80 NM proteins), the ‘nuclear protein database’ (34) (27 NM proteins) or the S/MARt-db (25) (80 NM proteins). We manually update NMPdb once a week at the moment and hope to maintain at least monthly updates for the years to come.

Document structure show

Annnotations

blinded