RESULTS
To explore the potential for a given HLA allele to produce an antiviral response, we assessed the HLA binding affinity of all possible 8-mers to 12-mers from the SARS-CoV-2 proteome (n = 48,395 unique peptides). We then removed from further consideration 16,138 peptides that were not predicted to enter the MHC class I antigen processing pathway via proteasomal cleavage. For the remaining 32,257 peptides, we repeated binding affinity predictions for a total of 145 different HLA types, and we show here the SARS-CoV-2-specific distribution of per-allele proteome presentation (predicted binding affinity threshold of <500 nM) (Fig. 1; see also Table S1 in the supplemental material). Importantly, we note that the putative capacity for SARS-CoV-2 antigen presentation is unrelated to the HLA allelic frequency in the population (Fig. 1). We identify HLA-B*46:01 as the HLA allele with the fewest predicted binding peptides for SARS-CoV-2. We performed the same analyses for the closely related SARS-CoV proteome (see Fig. S1 in the supplemental material) and similarly note that HLA-B*46:01 was predicted to present the fewest SARS-CoV peptides, in keeping with previous clinical data associating this allele with severe disease (49).
FIG 1 Distribution of HLA allelic presentation of 8- to 12-mers from the SARS-CoV-2 proteome. At right, the number of peptides (see Table S1) that putatively bind to each of 145 HLA alleles is shown as a series of horizontal bars, with dark and light shading indicating the number of tightly (<50 nM) and loosely (<500 nM) binding peptides, respectively, and with green, orange, and purple representing HLA-A, -B, and -C alleles, respectively. Alleles are sorted in descending order based on the number of peptides that they bind (<500 nM). The corresponding estimated allelic frequency in the global population is also shown (left), with the length of each horizontal bar indicating absolute frequency in the population.To assess the potential for cross-protective immunity conferred by prior exposure to common human coronaviruses (i.e., HKU1, OC43, NL63, and 229E), we next sought to characterize the conservation of the SARS-CoV-2 proteome across diverse coronavirus subgenera to identify highly conserved linear epitopes. After aligning reference proteome sequence data for 5 essential viral components (ORF1ab, S, E, M, and N proteins) across 34 distinct alpha- and betacoronaviruses, including all known human coronaviruses, we identified 48 highly conserved amino acid sequence spans (see Data File S1 in the supplemental material). Acknowledging the challenges to inferring cross-protective immunity among closely related peptides, we confined our attention exclusively to identical peptide matches. Among the conserved sequences, 44 SARS-CoV-2 sequences would each be anticipated to generate at least one 8- to 12-mer linear peptide epitope also present within at least one other common human coronavirus (Fig. 2; see also Table S2). In total, 564 such 8- to 12-mer peptides were found to share 100% identity with corresponding OC43, HKU1, NL63, and 229E sequences (467, 460, 179, and 157 peptides, respectively) (Table S3).
FIG 2 Amino acid sequence conservation of four linear peptide example sequences from three human coronavirus proteins. Protein sequence alignments are shown for nucleocapsid (N), membrane (M), and ORF1ab polyprotein (helicase) across all five known human betacoronaviruses (SARS-CoV-2, SARS-CoV, HKU1, OC43, and MERS-CoV) and two known human alphacoronaviruses (229E and NL63). Each row in the three depicted sequence alignments corresponds to the protein sequence from the indicated coronavirus, with the starting coordinate of the viral protein sequence shown at left and position coordinates of the overall alignment displayed above. Blue shading indicates the extent of sequence identity, with the darkest blue shading indicating a 100% match for that amino acid across all sequences. The four red-highlighted sequences correspond to highly conserved peptides ≥8 amino acids in length (PRWYFYYLGTGP, WSFNPETN, QPPGTGKSH, and VYTACSHAAVDALCEKA, see Table S2).For the subset of these potentially cross-protective peptides that are anticipated to be generated via the MHC class I antigen processing pathway, we performed binding affinity predictions across 145 different HLA-A, -B, and -C alleles (see Data File S3). As described above, we demonstrated the SARS-CoV-2-specific distribution of per-allele presentation for these conserved peptides. We found that alleles HLA-A*02:02, HLA-B*15:03, and HLA-C*12:03 were the top presenters of conserved peptides. Conversely, we note that 56 different HLA alleles demonstrated no appreciable binding affinity (<500 nM) to any of the conserved SARS-CoV-2 peptides, suggesting a concomitant lack of potential for cross-protective immunity from other human coronaviruses. We note, in particular, that HLA-B*46:01 was among these alleles. We note also that the putative capacity for conserved peptide presentation is unrelated to the HLA allelic frequency in the population (Fig. 3). Moreover, we see no appreciable global correlation between conservation of the SARS-CoV-2 proteome and its predicted MHC binding affinity, suggesting a lack of selective pressure for or against the capacity to present coronavirus epitopes (P = 0.27 [Fisher’s exact test]; see Fig. S2).
FIG 3 Distribution of HLA allelic presentations of highly conserved human coronavirus peptides with potential to elicit cross-protective immunity to COVID-19. At right, the number of conserved peptides (see Table S3) that putatively bind to a subset of 89 HLA alleles is shown as a series of horizontal bars, with dark and light shading indicating the number of tightly (<50 nM) and loosely (<500 nM) binding peptides, respectively, and with green, orange, and purple representing HLA-A, -B, and -C alleles, respectively. Alleles are sorted in descending order based on the number of peptides they are anticipated to present (binding affinity, <500 nM). The corresponding allelic frequency in the global population is also shown (left), with the length of each horizontal bar indicating absolute frequency in the population.We were further interested in whether certain regions of the SARS-CoV-2 proteome showed differential presentation by the MHC class I pathway. Accordingly, we surveyed the distribution of antigen presentation capacity across the entire proteome, highlighting its most conserved peptide sequences (Fig. 4). Throughout the entire proteome, HLA-A and HLA-C alleles exhibited the relatively largest and smallest capacities to present SARS-CoV-2 antigens, respectively. However, each of the three major class I genes exhibited very similar patterns of peptide presentation across the proteome (Fig. S3). We additionally note that peptide presentation appears to be independent of estimated time of peptide production during viral life cycle, with indistinguishable levels of peptide presentation of both early and late SARS-CoV-2 peptides (Fig. S4).
FIG 4 Distribution of allelic presentation of conserved 8- to 12-mers across the entire SARS-CoV-2 proteome for all HLA alleles and individually for HLA-A, HLA-B, and HLA-C (first, second, third, and fourth plots from top, respectively) with dark and light shading indicating the number of tightly (<50 nM) and loosely (<500 nM) binding peptides, respectively. Positions are derived from a concatenation of coding sequences (CDSs) as indicated in the bottom panel. Tightly binding peptides are confined to ORF1ab. The sequence begins with only the last 12 amino acids of ORF1a because all but the last four amino acids of ORF1a are contained in ORF1ab, and we considered binding peptides up to 12 amino acids (AA) in length.Given the global nature of the current COVID-19 pandemic, we sought to describe population-level distributions of the HLA alleles most (and least) capable of generating a repertoire of SARS-CoV-2 epitopes in support of a T-cell-based immune response. While we present global maps of individual HLA allele frequencies for the full set of 145 different alleles studied here (Data File S2), we specifically highlight the global distributions of the three best-presenting (A*02:02, B*15:03, and C*12:03) and three of the worst-presenting (A*25:01, B*46:01, and C*01:02) HLA-A, -B, and -C alleles (Fig. 5). Note that all allelic frequencies are aggregated by country but that they implicitly reflect the distribution of HLA data available on the Allele Frequency Net Database (52).
FIG 5 Global HLA allele frequency distribution heat maps for six HLA-A, -B, and -C alleles. The leftmost panels show the global allele frequency distributions by country for three representative alleles (HLA-A*02:02, HLA-B*15:03, and HLA-C*12:03) with the predicted capacities to present the greatest repertoire of epitopes from the SARS-CoV-2 proteome (21.1%, 19.1%, and 7.9% of presentable epitopes, respectively). The rightmost panels show the global allele frequency distributions by country for three representative alleles (HLA-A*25:01, HLA-B*46:01, and HLA-C*01:02) with the lowest predicted levels epitope presentation from the SARS-CoV-2 proteome (0.2%, 0%, and 0% of presentable epitopes, respectively). Heat map coloring corresponds to the individual HLA allele frequency within each country, ranging from lowest (white/yellow) to highest (red) frequency as indicated in the legend below each map.Finally, we acknowledge that nearly all individuals have two HLA-A/B/C haplotypes constituting as few as three but as many as six distinct alleles, potentially buffering against the lack of presentation from a single poorly presenting allele. We sought to determine whether allele-specific variability in SARS-CoV-2 presentation extends to full HLA haplotypes and to whole individual HLA genotypes. For six representative alleles with the highest (HLA-A*02:02, HLA-B*15:03, and HLA-C*12:03) and lowest (HLA-A*25:01, HLA-B*46:01, and HLA-C*01:02) predicted capacity for SARS-CoV-2 epitope presentation, these differences remain significant at the haplotype level, albeit with wide variability in presentation among different haplotypes (Fig. 6). Haplotype-level data for all 145 alleles are included in Fig. S5 and Data File S2. We then identified 3,382 individuals with full HLA genotype data and noted wide variability in their capacity to present peptides from the SARS-CoV-2 proteome, albeit with a small minority of individuals at either extreme (Fig. S6).
FIG 6 Distributions of SARS-CoV-2 peptide presentation across HLA haplotypes. The leftmost panels show the distributions of SARS-CoV-2 peptide presentation capacity for haplotypes containing one of three representative HLA alleles (HLA-A*02:02, HLA-B*15:03, and HLA-C*12:03) with the greatest predicted repertoire of epitopes from the SARS-CoV-2 proteome. The rightmost panels show the distributions of SARS-CoV-2 peptide presentation capacity for haplotypes containing one of three representative alleles (HLA-A*25:01, HLA-B*46:01, and HLA-C*01:02) with the lowest predicted levels of epitope presentation from the SARS-CoV-2 proteome. Black and gray bars represent full and partial haplotypes, respectively. Blue and red dashed lines represent the percentages of presented SARS-CoV-2 peptides for the indicated allele itself (blue) and its global population frequency weighted average presentation across its observed haplotypes (red).