3. Results 3.1. Open Reading Frames and Sequence Isolates for CoV-2-Cons Sequence Creation For creation of the CoV-2 Consensus sequence, nucleotide sequences from 1731 SARS-CoV-2 genomes were aligned and a full genome nucleotide consensus was created, 23 open reading frames (ORF) were then located in the alignment using the NC_045512.2 and the Finkel et al. [46] coordinates and translated to amino acids. Of the 23 ORF, 12 were canonical ORF as annotated in NC_045512.2 and 11 in alternative reading frames described by Finkel et al. [46] (Table 1). In addition, the membrane protein glycoprotein (M), is completely embedded inside an extended ORF (exORFM) without any frameshifts and was not used for separate OLP set design. 3.2. Overlapping Peptides (OLP) Sets Design In order to achieve a balance between the number of peptides needed to cover the whole SARS-CoV-2 proteome, the costs for peptide synthesis and the design of peptide sets that allow for detecting T cell responses with high sensitivity, three OLP sets were designed (Table 2). Shorter peptides (15 mers) with longer sequence overlap between adjacent OLP (11 amino acids) offer high resolution detection of responses, thus lowering the risk of missing longer epitopes located in the OLP overlap. The consequence, however, will be a higher number of peptides to synthesize and screen, in this case a set of 2821 OLP. When the overlap between OLP was reduced from 11 amino acids to 10, the sensitivity of OLP testing is maintained, but some longer epitopes located in the overlap of two OLP may be missed. With this caveat in mind, an OLP set of 15-mers overlapping by 10 residues helped reduce the number of peptides needed by 560 OLP (total number OLP required 2262). Similarly, longer peptides (18 mers) significantly reduce the number of OLP to be synthesized, but tend to reduce in vitro sensitivity [55]. This approach, with an 11 mer overlap, reduced the number of needed OLP to 1561. The final decision for a specific design may also be driven by the assay system used for screening, an a-priori focus on fewer or more viral proteins and the available cells and funding to test immunogenicity. The three full OLP sets with their entropies are included in Table S1. Of note, the 15–11 OLP sequences were subjected to a search for homologies in the human genome to predict molecular mimicry events related to the autoimmune process. A blastp search (>8aa consecutive identical amino acids per OLP) of the whole set against the human genome yielded no hits. 3.3. CoV-2-Cons Variability Analysis by Entropy Scores across the Full Genome Mismatches between the sequence of in vitro antigen sets and the autologous virus in an infected individual can lead to missed responses. This has been described for highly variable pathogens, such as HCV and HIV, and showed a direct relationship between sequence entropy and the frequency of detected responses [56,57]. Even though the variability of SARS-CoV-2 reported is substantially lower than for HIV and HCV, the sequence entropy was calculated at the amino acid level and as the mean OLP entropy in order to identify positions and OLP that may escape detection in T cell screening assays. Amino acid positional Shannon entropies were generally highly conserved, although specific more variable positions were identified (Figure S1), linked to specific amino acid variants. The ORF1ab protein, including three of the most variable positions, is shown in Figure 1. In the CoV-2-cons 15–11 OLP set, mean OLP normalized entropies were overall low (Range: 0.947–0.758) and comparable between OLP covering the canonical ORF (Range: 0.947–0.879) and OLP matching the alternative frameshift ORF (Range: 0.932–0.758). 3.4. Variant OLP Sequences to Cover CoV-2 Sequence Diversity Based on the SARS-CoV-2 alignment used to design the consensus, only nine amino acid positions in the entire SARS-CoV-2 genome showed two amino acids present in at least 25% of the sequences (Figure 2). Three of them were located in ORF1ab, one in the RNA polymerase and two in the Helicase sub-proteins. None of them were located close enough to each other to affect the same OLP. Still, the synthesis of a single consensus peptide could miss T cell responses in individuals exposed to the virus with the subdominant sequence variant. To prevent missing responses, a small number of additional OLP containing each of the variants were generated to cover the variability of these OLP, creating an additional set of 31 different variant OLP in the 15–11 OLP set (Table 2). 3.5. Conserved Protein Sequences Matching Other Coronavirus Family Member and Identification of Pan-Coronavirus Sequences In addition to variable positions, we also evaluated the presence of protein regions conserved among coronavirus species, as these may support the design of immunogen sequences for pan-coronavirus vaccines. A total of 26 regions, ranging from 8 to 23 amino acids, were identified as being conserved in at least one of the three different sequence alignments (Table 3). Fifteen fragments were identified in the pan-coronavirus alignment, 17 in the beta-coronavirus alignment and 12 in the human coronavirus alignment. Seven of them were detected in all three alignments. To identify potential T cell epitopes in these conserved regions, we searched the IEDB for described T-cell epitopes similar (>90% sequence identity) to the conserved peptides present in the CoV-2 consensus sequence. Interestingly, the majority of the conserved regions contained several matches, most of which were described epitopes derived from SARS-CoV. In total, 125 similar epitopes were identified, from all but two of the conserved regions (Table 3). The similar epitopes were found to be derived from the following organisms; SARS-CoV: 71, Human coronavirus 229E: 1, Alphacoronavirus 1: 1, Unknown origin: 3, and Homo sapiens: 47. Interestingly, 24 out of 26 fragments contained the described SARS-CoV T cell epitopes, indicating that these regions are immunogenic in humans and reinforcing the idea that some degree of cross-reactivity among coronavirus can be expected [11,58]. Also, the majority, i.e., 40 of the 47 human epitopes, clustered around one single region conserved in the beta-coronavirus alignment (QGPPGTGKSH). Several conserved peptides have thus been identified, which could potentially contain epitopes cross-reactive among different Coronavirus species. These conserved peptides can thus provide valuable information to understand if the immune response to SARS-CoV-2 is affected by previous infection with other coronaviruses and for pan-coronavirus vaccine design (Figure S2).