Results We identified a number of pathogens commonly causing infections in the European population (Supplementary Tables 2–5). The list of pathogens included 32 viruses, 11 fungi, 26 bacteria and 2 parasites. We obtained all protein sequences for these pathogens from NCBI, and compared these to predicted HLA-I and HLA-II binding epitopes in SARS-CoV-2, and ranked the pathogens based on a relevance score (see Methods; Fig. 1a) based on short exact sequence matches of length k ("k-mers"). We limited the analysis to include only the five most common HLA alleles in the European population as reported in the Allele Frequency Net Database36 (Supplementary Tables 6–7). Figure 1 Analysis approach. (a) k-mers for k ϵ {6, 7, 8} were extracted from the proteins of relevant human pathogens and compared to epitopes predicted in the SARS-CoV-2 proteins. The epitope prediction was by netMHCpan and netMHCIIpan. Pathogens were ranked based on exact k-mer hits to the epitopes. (b) Principle of edit distance determination of epitopes. The SARS-CoV-2 epitope MKFSDRPFMLH has a edit distance of 1 when compared to the putative pathogen epitope MKFSDRPFML_ because of the missing histidine at the C-terminus. The distance between MKFSDRPFMLH and MKFSDAPFMLHR is 2 because of the D6A exchange and the additional arginine at the C-terminus. The K2L, D5I, and F8S exchanges give rise to an edit distance of 3 between MKFSDRPFMLH and MLFSIRPSMLH. We observed different relevance scores for the same pathogen, depending on the length of k-mers (Supplementary Figs. 1 and 2). For k = 6 we found that all viruses were in the upper half when ranking the relevance scores, while the viruses were in the lower half for k = 8. This was independent of the HLA-class epitopes (Supplementary Fig. 1 and 2). Figure 2 Some viruses and bacteria have peptides (k-mers) matching SARS-CoV-2 epitopes. The top 10 pathogen relevance scores were averaged over k = 6,7,8 amino acids for (a) HLA-I, and (b) HLA-II. Pathogen relevance score for each pathogen and k are presented in Supplementary Figs. 1 and 2. Given the overall similarity between coronaviruses, the endemic coronaviruses are expected to have the highest relevance score. Given the compact genome size and highly optimized proteins37 it seems likely that only short stretches will have an exact match, even when using a reduced alphabet. Conversely, more complex organisms have larger genome, why the probability of matching longer stretches increases. We therefore focus on pathogens with short (k = 6) matches. Within the top 10 ranking pathogens for HLA-I binding SARS-CoV-2 epitopes based on k = 6, we find the two fungi Candida tropicalis and Cryptococcus neoformans, and the parasite Trichomonas vaginalis. Apart from the endemic coronaviruses HKU1, OC43, 229E, and NL63 we find the double-stranded DNA virus Human alphaherpesvirus 3 (varicella-zoster virus, VZV), the double-stranded RNA virus Rotavirus A (RV), and the single-stranded negative RNA virus Influenza B (Fig. 2a). When scoring the relevance based on matches to predict HLA-II SARS-CoV-2 epitopes, we find a similar result, the only difference being the appearance of Human Gammaherpesvirus 4 (Epstein-Barr virus, EBV) in place of Trichomonas vaginalis (Fig. 2b). Viral genes can be expressed at different times during an infection38. However, during the multiplication phase of the virus, the viral products are expressed in excess. To avoid selection of pathogens based on highly similar, but rarely or poorly expressed proteins, we therefore focus on the viruses in the following analysis. Using netMHCpan and netMHCIIpan we predicted the epitopes in the reference sequences for the coronaviruses OC43, HKU1, 229E, and NL63, as well as Influenza B, EBV, RV, and VZV. Assessing the similarity to SARS-CoV-2 predicted epitopes, we calculated the edit distance from each SARS-CoV-2 epitope to each of the predicted epitopes in the selected pathogens (Fig. 1b). The edit distance – also known as the Levenshtein distance – accounts for addition, deletion, and substitution of amino acids to transform one amino acid sequence into another. We opted for this distance metric to allow differences in epitope lengths. The edit distance was calculated per analyzed HLA. We found that the beta-coronaviruses OC43 and HKU1 had the highest number of epitopes identical to the predicted SARS-CoV-2 epitopes. For HLA-I bound epitopes we found 211 and 195 identical epitopes in HKU1 and OC3, respectively (Fig. 3a). For HLA-II bound epitopes we found 493 and 464 identical epitopes in OC43 and HKU1, respectively (Fig. 3b). When the similarity threshold was relaxed to an edit distance of 1 or 2 we found a similar pattern (Fig. 3). Interestingly, if we accept an edit distance of 3 we find the highest number of similar SARS-CoV-2 HLA-I epitopes in VZV, followed by OC43, and HKU1 with 1292, 1189, and 1163 epitopes, respectively (Fig. 3a). This was not reflected in HLA-II bound epitopes. The strong occurrence of SARS-CoV-2 similar VZV epitopes was mainly driven by epitopes on HLA-B and HLA-C, and to a minor degree on HLA-A (Supplementary Figs. 3–5). In agreement with previous reports39, we found a large number of identical or similar epitopes between SARS-CoV-1 and SARS-CoV-2 (Supplementary Fig. 6). Figure 3 OC43 and HKU1 epitopes can be presented on many HLAs. Epitopes were predicted using netMHCpan and netMHCIIpan in selected pathogens. The similarity between each SARS-CoV-2 epitope and pathogen epitope was calculated using the edit distance, and the number of shortest matches was enumerated. (a) Total number of HLA-I epitopes with a edit distance between 0 and 3. (b) Total number of HLA-II epitopes with a edit distance between 0 and 3. The pathogens are ordered per plot from highest to lowest, while the fill color is preserved. We next asked which coronavirus proteins might be the most likely inducers of cross reactivity. Arguably, the epitopes most likely to elicit a cross-reactive response are those found in many corona viruses and are presented by many HLAs. The latter constraint is important, since the current studies demonstrating cross-reactivity do not distinguish HLAs6,16,18–23. We first enumerated the number of SARS-CoV-2 epitopes for each of the viral proteins (Supplementary Fig. 7). Given that the OC43 and HKU1 coronavirus strains appear the most likely pathogen to create SARS-CoV-2 reactive T cells, we focused the analysis on these strains. We found that only epitopes from the SARS-CoV-2 polyprotein pp1ab have identical amino acid sequences to epitopes identified in both OC43 and HKU1. Interestingly, the epitopes from the S-protein are different at three positions or more compared to the amino acid sequences for the predicted OC43 and HKU1 epitopes. Since not only the number of epitopes but also the probability of the epitope to be presented, we also enumerated the number of epitope-presenting HLAs (Supplementary Fig. 8). The highest number of possible HLAs is 15 for HLA-I and 80 for HLA-II (50 DPA1-DPB1 combinations, 25 DQA1-DQB1 combinations, and 5 DRB1). Again, we found the highest number of HLAs and the highest similarity in epitopes from the SARS-CoV-2 polyprotein pp1ab. The polyprotein pp1ab is 7096 amino acids long, and 15 nonstructural proteins are created through autoproteolytic cleavage12. Comparison of the pp1ab amino acid sequence from SARS-CoV-2, HKU1, and OC43 revealed that the RNA-dependent RNA polymerase (RdRp), the helicase (Hel), the 3′–5′ exoribonuclease (ExoN), and the 2′-O-ribose methyltransferase generally have the highest similarity (Fig. 4a; upper panel). It is also in these regions that the near identical epitopes, as determined by an edit distance of 1 or less, are found (Fig. 4a; lower panel). An experimental study evaluated a set of 117 epitopes form HKU1 and OC43 for their cross reactive potential24. Two epitopes from HKU1 OC43 were identified as capable of raising a T cell response, and were nearly identical to two SARS-CoV-2, marked with ‘M’ in Fig. 4a. The S-protein epitopes previously found to expand public T cell clonotypes17, were not found to be shared with HKU1 or OC43. Figure 4 SARS-CoV-2 epitopes from conserved regions are nearly identical to OC43 and HKU1 epitopes. (a) Upper panel: Protein sequence for the polyprotein pp1ab from SARS-CoV-2, OC43, and HKU1 were aligned and the similarity between OC43 and HKU1 amino acids to SARS-CoV-2 was calculated. The individual proteins are marked above the similarity graph. Lower panel: The number of HLA-alleles that present SARS-CoV-2 pp1ab epitopes with a edit distance of 1 or less to epitopes predicted in both OC43 and HKU1. Two epitopes previously identified as cross reactive24 are marked by ‘M’. (b) Upper panel: Protein sequence for the S-protein from SARS-CoV-2, OC43, and HKU1 were aligned and the similarity between OC43 and HKU1 amino acids to SARS-CoV-2 was calculated. The individual proteins are marked above the similarity graph, where RBD gives the receptor binding domain. Lower panel: The number of HLA-alleles presenting SARS-CoV-2 S-protein epitopes with an edit distance of 3 or less to epitopes predicted in both OC43 and HKU1. (c) Upper panel: Protein sequence for the M-protein from SARS-CoV-2, OC43, and HKU1 were aligned and the similarity between OC43 and HKU1 amino acids to SARS-CoV-2 was calculated. The individual proteins are marked above the similarity graph: ‘VS’ indicates the portion of the M-protein on the virion surface, ‘Tr’ the transmembrane region, and ‘IV’ the intraviron portion. Lower panel: The number of HLA-alleles presenting SARS-CoV-2 M-protein epitopes with an edit distance of 3 or less to epitopes predicted in both OC43 and HKU1. (d) Upper panel: Protein sequence for the N-protein from SARS-CoV-2, OC43, and HKU1 were aligned and the similarity between OC43 and HKU1 amino acids to SARS-CoV-2 was calculated. The individual domains of the N-protein are marked above the similarity graph. Lower panel: The number of HLA-alleles presenting SARS-CoV-2 N-protein epitopes with an edit distance of 3 or less to epitopes predicted in both OC43 and HKU1. The height and color of the similarity graphs designate similarity such that white bars indicate no similarity, light blue bars with half height indicate 50% similarity and dark blue bars with full height indicate 100% similarity. The width of the bar in the lower panels indicates the length of the epitopes. Darker regions indicate overlapping epitopes. T cell immunity to the structural proteins has received substantial attention6,16,18–24. The S2 portion of the S-protein, which constitutes the stalk of the host interacting receptor12, shares the highest similarity between SARS-CoV-2, HKU1, and OC43 (Fig. 4b; upper panel). Interestingly, we found some similar HLA-I epitopes (edit distance of 3) but only one HLA-II epitope (Fig. 4b; lower panel). The majority of the HLA-I epitopes and the single HLA-II epitope fall within the relatively conserved portion of the S2. The similarity between SARS–CoV–2, HKU1, and OC43 for the M- and N-proteins is most prominent around position 110 in either of the proteins (Fig. 4c,d; upper panel). This corresponds to the N-terminus of the long intraviron tail of the M-protein, and a small part of the RNA-binding domain of the N-protein. Similar to the observation for the S-protein, we find some similar HLA-I epitopes (edit distance of 3) but only one a single position of HLA-II epitopes (Fig. 4c,d; lower panel). These epitopes also appear in the conserved regions. Collectively, these data indicate that near identical epitopes derived from endemic coronaviruses cannot explain the observed cross-reactivity to structural SARS-CoV-2 proteins.