1 Introduction and review 1.1 Background As far as is known to history, no coronavirus [1] has been as disturbing to humanity as the human pandemic of the 2019 novel coronavirus [2] now known as SARS-CoV-2. In quick response to determination of the final version of the RNA sequence of the Wuhan seafood market isolate, the present author examined functional sites of SARS-CoV-2 that are highly conserved across the coronaviruses [3,4], and which thus likely to exhibit escape mutation that can quickly undo the good work of the developers of vaccines and therapeutic agents. So far, the published papers have concerned the spike glycoprotein [[3], [4], [5]]. Exploration of known and newly found proteolytic cleavage sites in the spike glycoprotein of SARS-CoV-2 that are well conserved is a popular area of inquiry for SARS researchers because such sites can interact with human host airway proteases that could be the target for protease inhibitors as potential drugs (e.g. Ref. [6]). The difficulty is that inhibiting the action of human proteins could have undesirable effects on the host [6], so parallel work on other kinds of functional sites in the coronavirus proteins is of great importance. The present paper explores another potential functional site in the spike protein, but this one is a different because the site is not a proteolytic cleavage site, and it is not well conserved, except, it is argued, for a characteristic composition of particular amino acid residues. Expressed another way, there can exist certain subsequences of a protein sequence that are well conserved, but only in respect to some pattern or property that is less obvious than the order of amino acids. Finding them (or as is more correctly stated, predicting them) may therefore require a more subtle and, in the present case, novel bioinformatics tool, compared with the standard bioinformatics tools which were essential in the preceding papers [[3], [4], [5]]. Comparisons with other proteins as described below suggest that the subsequence of interest in this paper could have a crucial function, and a high degree of conservation is, even by itself, also a clue as having a role important to the virus [5]. Hence such a site may represent a potential therapeutic target, perhaps as well as representing a synthetic vaccine target. However, until very recently, that crucial function did not even seem to be possessed by SARS-CoV and SARS-CoV-2, and the details have yet to be elucidated. 1.2 Binding sialic acid glycans - a traditional picture from the influenza virus The particular virus function that is considered in the present paper is non-covalent binding to the sialic acid glycans, i.e. oligosaccharides or polysaccharides that contain sialic acid residues. They are sometimes called sialylated glycans. Interest in this binding arose as follows. It seems unlikely (although of course possible) that functions important for many different kinds of virus are of little importance to others, especially if they have a common lifestyle such as infection of the respiratory system or alimentary tract, typically reflected by common symptoms. If such functions are absent, it begs the question of how the virus copes. Though glycan binding of SARS-CoV and SARS-CoV-2 seems absent, diminished, or relatively neglected in the literature (see Section 1.5), many coronaviruses such as human coronavirus OC43 and bovine coronavirus appear to recognize sialic acid as a receptor. However, most biology students are more familiar with the hemagglutinin and neuraminidase of influenza, the H and N in, for example H1N1 (the numbers such as 1 being based on immunological typing of these proteins), that bind to glycans, (sugar chains, oligosaccharides or polysaccharides) at cell surfaces notably those chemically bound to membrane proteins, hence called glycoproteins, of host cells. The surfaces of many animal and all vertebrate cells are dressed with a dense and complex array of glycans primarily containing sialic acids, attached to proteins and lipids at the cell surface. Such glycans also occur to a lesser extent in other organisms, ranging from fungi to yeasts and bacteria, and they are present at the surface of many viruses derived from animal hosts. Glycans can contain several kinds of sugar, including notably sialic acid, glucose, mannose, fucose, N-Acetylglucosamine, and N-Acetylgalactosamine. The standard emotive picture is that the influenza hemagglutinin binds the cell surface glycan molecules to first locate the lung cell surface, and that the neuraminidase has a later role, to enable many thousands (perhaps hundreds of thousands of) “baby viruses”, i.e. the newly formed virions, to cut their way out the protective layer of glycans when emerging from the cell. More correctly stated, when the replicated viruses bud from the host cells, they remain attached to the host-cell surface by binding between hemagglutinin and the “tips” of the glycan chains, and the neuraminidase is used to sever that link by breaking certain links between the component sugar residues (see below). Recent work has supported this long standing picture for influenza viruses, but also answers affirmatively to the question that must have arisen in many student's minds, i.e. that the neuraminidase must also be important for the virus to cut its way into the cell in the first place [7]. Any such description of entry does not, however, quite fit in with the above “more correctly stated” model for final release of the virion progeny, because it is not obvious why the incoming infecting virus should bind to the cell surface and then be made to disengage. Nonetheless, many viruses appear to need and do have an enzyme to achieve similar results, even if that enzyme is not of neuraminidase type and dissociates the virus from the cell in other ways: e.g. see Ref. [8] and discussion below. It does seem reasonable that all such similar results must provide assistance in the mobility of virus particles through the respiratory tract mucus, but a fuller picture should perhaps include the notion of “decoy” glycan molecules [9] as discussed below. 1.3 The great diversity of sialic acid glycans The seeming challenge for research into non-covalent binding to glycans is that they are diverse molecules, and that remains true even among the sialic acid glycans. At first consideration, this would seem to suggest that the prediction of glycan binding sites will be difficult. Critical features that distinguish glycans of interest in the present paper are the terminal sugar residues on the oligosaccharide chains at the distal ends (i.e. the “tips”), where sialic acids are important, and the amino acid residue attachment site (N- or. O-) to the membrane protein at the base. The N (nitrogen) protein attachment point attachment is asparagine and the O (oxygen) protein attachment point is usually serine or threonine, but sometimes tyrosine, or occasionally other amino acids which are hydoxylated as a post-translational modification. Covalent connections of that kind are those to the host cell surface proteins but, as indicated above, it is the non-covalent binding of a virus to them that is of interest here. Covalent binding is controlled by host enzymes and so would appear, again at first consideration, to be more specific. Enveloped viruses such as SARS-CoV-2 also have their own bound sialic acid glycans (the spike protein is usually referred to as the spike glycoprotein), but these are of less direct interest here although they can clearly influence binding of a virus to various receptors. Despite their diversity, and perhaps because of it (i.e. because that diversity implies more information content) sialic acid glycans of host cells are key molecular recognition features not only for entry of viruses such as influenza, but also in embryonic development, neurodevelopment, reprogramming, and oncogenesis. Correctly speaking, even sialic acid itself is diverse. It is a generic term for a family of derivatives of the nine-carbon sugar neuraminic acid. The sialic acid family includes some 43 derivatives of neuraminic acid, but these acids rarely appear free in nature. Members include N-acetylneuraminic acid, 2-keto-3-deoxy-d-glycero-d-galacto-nonulosonic acid, 5,7-diamino-3,5,7,9-tetra-deoxy-d-glycero-d-galacto- nonulosonic acid, and 5,7-diamino-3,5,7,9-tetra-deoxy-l-glycero-l-manno-nonulosonic acid. If the term “sialic acid” is used unqualified, it usually refers to the representative member of this group, N-acetylneuraminic acid. The variability of glycans is not random but reflects their modes of synthesis. In eukaryotes generally, a typical N-linked glycan has an initial core that consists of 14 residues (3 glucose, 9 mannose, and 2 N-acetylglucosamine). This preassembled glycan is usually transferred by a glycosyltransferase oligosaccharyltransferase to a nascent peptide chain within the reticular lumen. This initial core 14-sugar unit is assembled in the cytoplasm and endoplasmic reticulum and other sugars may be added later. In contrast, O-linked glycans are assembled one sugar at a time at the outset on proteins in the Golgi apparatus. There are some specific features of medical interest as relevant to the human host of viruses (but by no means unique to humans). The lung epithelial glycans are typical by having sialic acids as the distal residues, and it is these that the influenza neuraminidase cleaves away. Most soluble secreted proteins are also similarly decorated with such glycans. That includes the proteins that make up saliva and mucus in the airway, and are in general important for viral infection. Both N- and O- and glycosphingolipid-glycans are found in human lungs, and they include large and complex-type N-glycans with linear poly-N-acetyllactosamine [-3Galβ1–4GlcNAcβ1-]n extensions, which are predominantly terminated in α2,3-linked sialic acid. In contrast, the smaller N-glycans lack poly-N-acetyllactosamine but are enriched in α2,6-linked sialic acids. There are also large glycosphingolipid glycans, which also consists of poly-N-acetyllactosamine, usually terminating in α2,3-linked sialic acid. While it is commonly maintained that viruses such as influenza virus bind to the sialylated glycans, and this is assumed in the present paper, some care is required, because there are also non-sialylated glycans in human lungs on which viral binding could occur. 1.4 Most coronaviruses have “receptor destroying activity” While it is binding of viruses to sialic acid glycan that is of interest here, it should be kept in mind that it is often associated with a catalytic activity in which the sialic acid glycan is the substrate. In influenza these two aspects are particularly distinct by being on separate surface proteins, the hemagglutinin and neuraminidase respectively, but that separation is not true of all viruses. Not surprisingly, as noted above, most coronaviruses of the coronaviridae family also have capabilities to bind to sialic acid glycans, but they also have the ability to cleave the glycans, which is often described by authors as a “receptor destroying activity”. Like influenza C viruses, purified bovine coronavirus preparations have an esterase activity which inactivates O-acetylsialic acid-containing receptors on erythrocytes; diisopropyl fluorophosphate completely inhibited this receptor-destroying activity suggesting that the viral enzyme is in this case a serine esterase [8]. This is believed to facilitate the spread of virus infection by removing receptor determinants from the surface of infected cells (see discussion below) and prevent the formation of virus aggregates. Another coronavirus, porcine transmissible gastroenteritis virus (TGEV) recognizes N-glycolylneuraminic acid. Nor does it depend on the sialic acid binding activity for infection of cultured cells, but interaction with sialoglycoconjugates appears to help the virus to pass through the sialic acid-rich mucus layer in the epithelium of the small intestine. Hemagglutinin-esterases are a family of viral envelope glycoproteins that mediate reversible attachment to O-acetylated sialic acids. These too are said to be receptor-destroying, but the enzymic activity reaction is in this case not a cleavage in the sense used concerning neuraminidase, but rather a change in molecular recognition by removal from of the acetyl group from the C9 position of the above acetylated neuraminic acid residues. The other and probably major reason for researchers thinking of these actions as “receptor destroying” is because the picture is a little more complex than just allowing entry and exit from the host cell. Because many viruses attain host cell specificity by being selective for particular types of sialic acid, these may occur as decoys to the virus on off-target host cells and on free molecules in the extracellular environment. To prevent irreversible binding to these decoys, many viruses including many coronaviruses have receptor-destroying enzymes that are therefore interesting targets for antiviral intervention, exemplified by the influenza A virus neuraminidase [9]. 1.5 Where are these important functions in SARS-CoV and SARS-CoV-2? Even though such glycan binding domains and enzymes as neuraminidases are found in many coronaviruses, there seems to be no such enzymes in SARS-CoV and SARS-CoV-2. Viruses of the lower respiratory tract, such as influenza virus, respiratory syncytial virus, and SARS-related coronaviruses, are generally considered as having key differences that require different therapeutics [10] even though relatively little is lost in considering already approved drugs for one of such viruses against the other (e.g. Refs. [11]). Typically, the apparent absence of glycan binding and enzymic sites in SARS-CoV and SARS-CoV-2 has been dismissed as due to the fact that the virus enters on ACE2, i.e. angiotensin converting enzyme type 2 (e.g. see Ref. [5] for discussion), not on a glycoprotein. This does not, however, escape from the intuitively important need for preliminary binding, cell entry and exit through the glycan layer, and probably the decoy-related function discussed above. There appears to be growing evidence of significant lectin-binding capability. Lectins are the carbohydrate-binding proteins that are highly specific for sugar groups of other molecules. Activation of C-type lectin receptor and other similar receptors contributes to pro-inflammatory response to many coronavirus infections. There also are studies over several years that locate glycan binding and even related catalytic activities in the spike glycoprotein. It has been noted that E3 protein of bovine coronavirus is a receptor-destroying enzyme with acetylesterase activity [8], and the 3D structure of coronavirus hemagglutinin-esterase offered insight into coronavirus and influenza virus evolution, with implications for drug and antibody discovery [9]. The location of any sialic acid glycan binding region of SARS-CoV-2 is, a priori unclear, although intuitively (a) it would likely be associated with the cap or knob at the outer end of the spike protein, or (b) at least not involve exactly the same domain as is required for other important functions. Although throughout the coronaviruses various external proteins and domains can recognize either protein or sugar receptors or both, the majority of such studies like those above implicate the S1 region in their spike glycoproteins, but as discussed in the present paper, there are other potential sugar binding sites that are still within the spike protein. Overall, the SARS-CoV-2 spike glycoprotein has 1273 amino acid residues and until early 2020 understanding of structure was heavily based on SARS-CoV spike glycoprotein (1255 amino acids) with 20–27% amino acid residues similarity among non-SARS coronaviruses. Most of the spike protein appears to be involved in the specific stages of cell entry. The spike glycoprotein of SARS-CoV and SARS-CoV-2 is translated as a large polypeptide that is later cleaved to S1 and S2 sites. After binding to the main receptor that that is held to be primarily ACE2, the host proteases activate the virus by cleaving first at the S1/S2 boundary (i.e. S1/S2 site) and then within S2, i.e. at the S2’ site. The spike of similar coronavirus have long been considered as being in two main states (i) the pre-fusion form (the form of the mature virion) and (ii) post-fusion form, the form after membrane fusion has been completed). More detailed studies have split the latter into a pre-hairpin intermediate state, and post-fusion hairpin state. Somewhat like in all virus Class I fusion proteins, the S2 protein contains two heptad repeat regions (HRs) of which one (HR2) is located close to the transmembrane anchor. Membrane fusion occurs when there is a conformational change in the HRs to form a fusion core. The HRs of the protein fold into a coiled-coil structure, known as the “fusogenic state”. As virus and target cell membranes fuse, the coiled coil regions (called heptad repeats) become a trimer-of-hairpins structure. The S2’ cleavage site appears particularly important by being well conserved [[3], [4], [5]] and proteolysis by cathepsin appears sufficient to expose the fusion peptide of S2 and activate fusion within the host cell endosome. In general, S2’ is now considered as the key viral fusion peptide which is unmasked following S2 cleavage. Subsequently, S1 dissociates from S2, allowing S2 to transition to the post-fusion structure. The following locations in the sequences of amino acid residues apply specifically to SARS-CoV-2. These vary somewhat with author, and the following are used here. The signal peptide (SP) comprises residues 1–19. On the inside of the lipid membrane, the carboxyl terminus (C-terminus) is comprised of the transmembrane region (TM) comprising residues 1214–1236 and the cytoplasmic tail (CT) residues1237-1273. The extracellular domain of the spike glycoprotein is comprised of N-terminal domain (S1-NTD) comprising residues 20–286, and is of particular interest here. The host cell receptor binding domain (RBD) comprising residues 319–541. In summary the key regions are as follows.SP 1-19 S1-NTD 20–286 RBD 319-541 S2 686-1213 TM 1214-1236 CT 1237-1273 Fig. 1 shows the external part 20–1213 of the spike glycoprotein of SARS-CoV-2 in the closed state prior to ACE2 binding, with S1-NTD domain (the “ears”, dark blue) of interest here, RBD (at tip, subdomains light blue, blue-green), S2 (subdomains, orange, green, yellow). The orange-white, green-white and yellow-white helical structures are the α-helices of the trimer that form the neck associated with S2, and the red-white helical structures are the start of the transmembrane α-helices TM. Fig. 1 Spike protein of SARS-CoV-2 PDB entry 6VVX, showing S1-NTD domain (dark blue). See text in regard to the significance of the other colors. 1.6 The function of S1-NTD In at least some coronaviruses, S1-NTD is known to be involved in binding host proteins or glycans, but coronaviruses show great diversity in their binding which presumably underlies their ability to jump between very different host species. While the role of S1-NTD Compared with the current reasonably detailed knowledge of the remarkable mechanism of cell entry involving ACE2 and changes to the spike protein on cleavage, the specific function of S1-NTD of SARS-CoV-2 has not been elucidated (at least, not by the time of writing in April 2020). As noted above, S1 in SARS-CoV-2 is now well known to have a region which is the receptor binding domain to human ACE2 but also, significantly for what follows in the text below, SARS-like coronaviruses can bind CLEC4M/DC-SIGNR C-type lectin domains on host cells. See Ref. [12] for review of the diverse receptor recognition mechanisms of coronaviruses up to 2015, which represented the body of understanding until the COV-19 pandemic. Bovine coronavirus is an example of a coronavirus for which it seems clear that S1-NTD has an established glycan-binding function. Although the structure of a sugar-bound Bovine CoV S1-NTD was not available, some conclusions could be reached by researchers using structure-guided mutagenesis and comparisons with different coronaviruses. As well as evidence in 2008 linking hemagglutinin-esterase to the S1 domain of at least some coronaviruses [9], Zhang and Yap [13] had reported in 2004 a rational 3D model for S1 domain of SARS-CoV spike protein by fold recognition and molecular modeling techniques, and there they noted a suggestive structure similarity between S1 protein and influenza virus neuraminidase [14]. This opened up the possibility for those authors that existing anti-influenza virus inhibitors and anti-neuraminidase antibody could be used as a starting point for designing anti-SARS drugs, vaccines and antibodies [14]. Based on such observations and discussion so far, it is therefore reasonable to propose that S1-NTD could be important in the binding of certain alternative host cell surface receptors, or perhaps which aid in targeting the virus to ACE2, and so might provide a helpful therapeutic target (as well as candidate antigenic site for synthetic vaccine design). Nonetheless, such functions if present in SARS-CoV-2 could, a priori, reside in other domains at the virus surface. The challenge for research here is that the substantial knowledge concerning such matters in well-studied coronaviruses is not readily transferable. Variously throughout the coronaviruses, S1-NTD, CTD and S2 regions can recognize either protein or sugar receptors or both in various cases, and very similar coronavirus spike protein domains within the same genus may recognize different host cell receptors, while many very different coronaviruses may recognize the same host cell receptor. The studies mentioned above also suggested that at least some coronavirus S1-NTDs are evolutionarily related to human galectins, the term typically used for the lectins as carbohydrate-binding proteins that are specifically involved in inflammation, immune responses, cell migration, autophagy and signaling; however the viral domains derived from them have diverged with specificities for different sugar receptors [12]. Further review is given throughout Results Section 4 where appropriate. Another challenge discussed in Results Section 4 is that key regions for which an experimental 3D structure would help resolve the matter are disordered, and hence invisible, in current available spike protein structures.