3 Methods The overall approach comprised the following steps.i. Automated gathering of information from the World Wide Web regarding hemagglutinins, neuraminidases, and sugar binding proteins, particularly but not solely of viruses, using “autosurfing”, natural language processing, and knowledge extraction techniques [4]. Note that it is in particular non-covalent sialic acid glycan binding sites that are being explored, not for example asparagine or serine or threonine sites to which glycans are covalently linked. This knowledge gathering approach, as described in Refs. [4], is not essential for reproducing the present work or for carrying out comparable studies, but it does greatly accelerate research and preparation of the scientific paper, allowing fast responses to a new epidemic [[3], [4], [5]]. ii. Attempted discovery of continuous short sequences of amino acid residues (potential “sequence motifs”) with patterns and amino acid content common to SARS-CoV-2 spike protein amino acid sequences and hemagglutinins and neuraminidases, particularly those of influenza viruses. Also, more generally, in preparative work, comparison of the spike protein sequence of the spike protein with sialic acid glycan binding proteins and other sugar binding proteins, or domains of them. Protein sequences or parts of them used as input for any part of the study were obtained from GenBank https://www.ncbi.nlm.nih.gov/genbank/. The standard method of bioinformatics used for detecting in large protein sequence databases any amino acid residue sequences similar to those of an input sequence was primarily BLASTp at https://blast.ncbi.nlm.nih.gov/Blast.cgi. The standard tool for a more formal and typically multi-sequence alignment was Clustal Omega at https://www.ebi.ac.uk/Tools/msa/clustalo/. These tools can be automatically accessed by the present author's methods [4] but again that is not essential for reproducibility of the present work. iii. Examination of patterns in potential or known short subsequences in small proteins or domains known to have a function involving non-covalent sialic acid binding and, in absence of any clear patterns, study of the amino acid content of the subsequences. This established a preliminary sialic acid glycan binding score (SABS) for the twenty naturally occurring amino acid residues. However, the short subsequences identified were to be considered as “signals” or “fingerprints” for sialic acid glycan binding domains as a whole. That is, direct contact with sialic acid was not necessarily at (or solely at) the specific short subsequence, for reasons discussed in Results Section 4. This, plus a sequence rather than three dimensional structure perspective, and a specific focus on binding sialic acid glycans rather than sugars in general, resulted in a substantial difference in scores from another major method of predicting sugar binding regions of proteins also discussed later below. iv. Development of the above as an algorithm SABR-P for identifying potential small proteins or domains of proteins that non-covalently bind sialic acid glycans, by predictions on a test data set of protein sequences. As noted above the subsequences predicted are taken to indicate the glycan binding domain as a whole, not necessarily the sialic acid sites per se, but they may be. This approach involved noting true positive and negative predictions and false positives and negative predictions so as to optimize sensitivity and specificity. This was done specifically in regard to non-covalent binding of sialic acid glycans. In other words, it was done so as to distinguished sialic acid glycan binding domains not only from those domains known not to bind sugars but also from those that bind sugars and glycans that do not contain sialic acid. v. Examination of the three dimensional structures of regions of the regions of the SARS-CoV-2 spike protein predicted as binding sialic acid glycans to propose and locate a sialic acid binding function of SARS-CoV-2 (possibly but not necessarily associated with some kind of enzymic activity). Results and discussion in the present paper used the same amino acid codes as the above tools and data bank use, i.e. the IUPAC (International Union of Pure and Applied Chemistry) one letter amino acid codes, given in Table 1 below. For completeness, conservative replacements in column 3 of Table 1 are given. They relate largely to substitutions that can usefully be made in the design of synthetic peptides [4,5]. This is an application which is not specifically discussed in the present paper but which could be a basis for design of synthetic vaccines and preventative or therapeutic agents [4,5], in this case targeted at sialic acid glycan binding site of a virus. As discussed in the sequence of steps for the methodology above, the sialic acid glycan binding motifs are taken to be indicators of the sialic acid binding domain and not necessarily of the target sites per se, but they may be, and often are, potential target sites. The list of conservative replacements also remains useful for considering substitutions that are conservative in maintaining similar amino acid properties when detecting and comparing related sequences. Table 1 One letter amino acid codes and sialic acid site binding region measures discussed in the text. One letter code Amino acid Conservative replacements Preliminary sialic acid binding amino acid score SABS (see Results) SABR-P prediction method refined parameters (see Results) A alanine A, E, S, T 1 1 C cysteine/cystine S, T, V 1 1 D aspartic acid E 1 1 E glutamic acid A, D 0 0 F phenylalanine M, W, Y 1 2 G glycine N, P 1 1 H histidine K, R 1 2 I isoleucine L, V 0 0 K lysine H, R 0 0 L leucine I, V 0 0 M methionine F, W, Y 0 0 N asparagine G, D, Q 1 1 P proline G 0 0 Q glutamine N, E 0 0 R arginine H, K 0 0 S serine A, T 1 1 T threonine A, I, S 1 1 V valine A, I, L 0 0 W tryptophan F, M, Y 2 4 Y tyrosine F. M, W 1 2 Since for both reasons they be useful in deeper consideration of many results in the present paper, some comment may be useful to researchers less familiar with bioinformatics. See Ref. [4] for a further account. Note that the work of considering what is a conservative replacement is done automatically by the standard bioinformatics tools used. The replacements in Table 1 are consistent with the conservative replacement rules implied by the tables of weights implemented automatically in BLASTp and Clustal Omega mentioned above, which are discussed at those sites. However, the original intent as an application to peptide design means that in Table 1 there is a degree of asymmetry based on the author's experience in peptide design [4] because one is going from a natural protein state to less natural one without evolution making compensatory changes in the rest of the protein or system. For example, empirical studies show that serine (S) can be replaced by alanine (A) or threonine (T) but it is frequently important that a replacement to threonine should be isoleucine (I) in order to retain stability of a β-pleated sheet in which they occur. Strictly speaking, these are just fairly crude rules-of-thumb: the best replacements are dependent on more specific circumstances and detailed conformational and binding calculations. The assignment in Table 1 are not seen as controversial because apart from the asymmetry they relate to the “interchangability” or “alternative rule” of amino acid residues by many authors that are intended as universal, i.e. intended to apply to all proteins. This is because they relate to similarity of amino acid residues in terms of physicochemical, conformational, as well as biological properties of many sequences that are at least universal to, say, vertebrates. However, they are historically more directly empirically based on well-known studies probabilities of amino acid differences found by comparing amino acid residue sequences amongst fairly related proteins from a wide range of sets of different proteins, such that one is comparing sequences of hemoglobins, or of lysozymes, or of cytochromes C, and so on. As is to some extent customary in the field, three letter codes (such as GLY for glycine) are used for the amino acids in the molecular graphics figures; these codes are fairly self-evident at least in the direction of deducing the full name of the amino acid being represented. There was also use of data and the associated graphics tools in the Protein Data Bank (PDB) at https://www.rcsb.org/and in Japan https://pdbj.org/which was used for Fig. 1. Energy calculations by the author's own KRUNCH and by a commercial Sculpt protein modeling package were used in Ref. [4], but were not required for the present study, with the exception that some calculations by these tools were used to obtain earlier unpublished results on sugar binding to amino acid residue sidechains. This provided a check on the preliminary sialic acid binding capability shown in column 4 of Table 1, used initially in the present paper. KRUNCH is a molecular mechanics modeling package that essentially functions like many standard molecular modeling packages. There is the arguable exception that it gives much more attention than usual to novel algorithms for navigating through multiple energy minima and discovering new conformers, but that capability did not appear to be too important in the present study. For the much greater part, however, these binding assessments were based on the amino acid residues observed by the author in sequences involved in sugar binding sites in proteins (found by visual examination of binding sites of entries in the PDB) and similar qualitative observations by other authors. That is, they also reflect rather general opinion of what amino acids are involved in sugar binding and in its most general formulation this intuitively comprises aromatic residues, and hydrogen bonding residues to interact with the sugar hydroxyl groups. At the outset, as a starting point only, column 4 of Table 1 of these preliminary sialic acid binding amino acid scores (SABS) are really assignments that are qualitative, using 0 for not often present in sialic acid glycan binding sites and 1 for often present,. However, tryptophan was assigned a double score of 2 reflecting its larger size and double ring. How reliable these assignments are in regard to sialic acid glycan binding is what is assessed on a more objective basis by the prediction method developed in this paper, including a degree of recalibration. The marginally modified parameters are also shown in the last column Table 1 for convenience of comparison. While as a methodological strategy it was tempting to start from an alternative more objective and established approach discussed in Results Section 4, or at least to use it as a starting point or as an important “gold standard” for comparison, it has substantially different aims.