PMC:4331676 / 10460-11802
Annnotations
{"target":"https://pubannotation.org/docs/sourcedb/PMC/sourceid/4331676","sourcedb":"PMC","sourceid":"4331676","source_url":"https://www.ncbi.nlm.nih.gov/pmc/4331676","text":"Dataset\nTo construct a high quality benchmark dataset, only experimentally confirmed data were collected. The benchmark dataset S can be formulated as\n(1) S = S + ∪ S -\nwhere the subset S+ contains 525 DNA-binding proteins, the subset S- consists of 550 non DNA-binding proteins and the symbol ∪ represents the \"union\" in the set theory. The benchmark dataset was obtained according to the following procedure. (1) Extract DNA-binding protein sequences from Protein Data Bank (PDB) released at December 2013 by searching the mmCIF keyword of 'DNA binding protein' through the advanced search interface. (2) Remove the sequences with length of less than 50 amino acid residues and character of 'X'. (3) Utilize PISCES to cutoff those sequences that have \u003e= 25% pairwise sequence identity to any other in the same subset. Thus the subset S+ consisting 525 sequences is obtained. (4) Randomly extract some non DNA-binding proteins from Protein Data Bank, then utilize PISCES to cutoff those sequence that have \u003e= 25% pairwise sequence identity to any other in the same subset and remove all the sequences with less than 50 amino acids or with character of 'X'. Thus the subset S- containing 550 sequences is obtained. A complete list of all the PDB codes and sequence for the benchmark dataset can be found in Supporting Information S1.","divisions":[{"label":"title","span":{"begin":0,"end":7}},{"label":"p","span":{"begin":8,"end":150}},{"label":"p","span":{"begin":151,"end":177}},{"label":"label","span":{"begin":151,"end":154}}],"tracks":[]}