Additional file 1 benchmark dataset S. It contains 1075 protein sequences, which are classified into subset with 525 DNA-binding proteins (positive samples) and subset with 550 non-DNA-binding proteins (negative samples). Both the accession identifier of PDB (Protein Data Bank) and sequences are given.