PMC:7307149 / 21550-27712 JSON TXT

Annnotations TAB JSON ListView MergeView

LitCovid-PD-FMA-UBERON

LitCovid-PD-UBERON

{"project":"LitCovid-PD-UBERON","denotations":[{"id":"T2","span":{"begin":5150,"end":5155},"obj":"Body_part"}],"attributes":[{"id":"A2","pred":"uberon_id","subj":"T2","obj":"http://purl.obolibrary.org/obo/UBERON_0002544"}],"text":"MATERIALS AND METHODS\n\nSequence retrieval and alignments.\nFull polyprotein 1ab (ORF1ab), spike (S) protein, membrane (M) protein, envelope (E) protein, and nucleocapsid (N) protein sequences were obtained for each of 34 distinct but representative alpha and betacoronaviruses from broad genus and subgenus distributions, including all known human coronaviruses (i.e., SARS-CoV, SARS-CoV-2, MERS-CoV, HKU1, OC43, NL63, and 229E). FASTA-formatted protein sequence data (the full accession number list is available in Table S5 in the supplemental material) were retrieved from the National Center of Biotechnology Information (NCBI) (67). For each of the protein classes (i.e., ORF1ab, S, M, E, and N), all 34 coronavirus sequences were aligned using the Clustal Omega v1.2.4 multisequence aligner tool employing the following parameters: sequence type [Protein], output alignment format [clustal_num], dealign [false], mBed-like clustering guide-tree [true], mBed-like clustering iteration [true], number of combined iterations 0, maximum guide tree iterations [-1], and maximum HMM iterations [-1] (68). For the purposes of estimating time of viral peptide production, we classified ORF1a and ORF1b peptides as “early” whereas all other peptides produced by subgenomic mRNAs were classified as “late” (69, 70).\n\nConserved peptide assessment.\nAligned sequences were imported into Jalview v. 2.1.1 (71) with automated generation of the following alignment annotations: (i) sequence consensus, calculated as the percentage of the modal residue per column; (ii) sequence conservation (0 to 11), measured as a numerical index reflecting conservation of amino acid physicochemical properties in the alignment; (iii) alignment quality (0 to 1), measured as a normalized sum of BLOSUM62 ratios for all residues at each position; and (iv) occupancy, calculated as the number of aligned residues (not including gaps) for each position. In all cases, sequence conservation was assessed for each of the following three groups: only human-infecting coronavirus sequences (n = 7), all betacoronavirus sequences (n = 16), and all alpha- and betacoronavirus sequences combined (n = 34). Aligned SARS-CoV-2 sequences and all annotations were manually exported for subsequent analysis. Conserved human coronavirus peptides were defined as those with a length of ≥8 consecutive amino acids, each showing agreement with SARS-CoV-2 sequences and ≥4 other human coronavirus sequences with the consensus sequence (Table S2). For each of these conserved peptides, we also assessed the component number of 8- to 12-mers sharing identical amino acid sequence between SARS-CoV-2 and each of the four other common human coronaviruses (i.e., OC43, HKU1, NL63, and 229E) (Table S3). For all peptides, human, beta, and combined conservation scores were obtained using a custom R v.3.6.2 script representing mean sequence conservation (minus gap penalties where relevant) (see https://github.com/pdxgx/covid19).\n\nPeptide-MHC class I binding affinity predictions.\nFASTA-formatted input protein sequences from the entire SARS-CoV-2 and SARS-CoV proteomes were obtained from the NCBI RefSeq database (67) under accession numbers NC_045512.2 and NC_004718.3. We kmerized each of these sequences into 8- to 12-mers to assess MHC class I-peptide binding affinity across the entire proteome. MHC class I binding affinity predictions were performed using 145 different HLA alleles for which global allele frequency data were available as described previously (72) (see Table S5) with netMHCpan v4.0 (73) using the ‘-BA’ option to include binding affinity predictions and the ‘-l’ option to specify peptides 8 to 12 amino acids in length (Table S1). Binding affinity was not predicted for peptides containing the character ‘|’ in their sequences. Additional MHC class I binding affinity predictions were performed on all 66 MHCflurry-supported alleles (–list-supported-alleles; Table S6) using both MHCnuggets 2.3.2 (74) and MHCflurry 1.4.3 (75) (see Tables S7, S8, and S9 and Fig. S7 to S10 in the supplemental material). We further cross-referenced these lists of peptides with existing experimentally validated SARS-CoV epitopes present in the Immune Epitope Database (Table S4) (76). We then performed consensus binding affinity predictions for the 66 supported alleles shared by all three tools by taking the union set of alleles and filtering for peptide-allele pairs matching the union set of alleles. For the SARS-CoV-specific and SARS-CoV-2-specific distributions of per-allele proteome presentation, we exclude all peptide-allele pairs with \u003e500 nM predicted binding. In all cases, we used the netchop v3.0 (77) “C-term” model with a cleavage threshold of 0.1 to further remove any peptides that were not predicted to undergo canonical MHC class I antigen processing via proteasomal cleavage (of the peptide’s C terminus).\n\nGlobal HLA allele and haplotype frequencies.\nHLA-A, -B, and -C allele and haplotype frequency data were obtained from the Allele Frequency Net Database (52) for 805 distinct populations pertaining to 101 different countries and 2,628 distinct major/minor (4-digit) alleles, corresponding to 20,478 distinct haplotypes (https://github.com/pdxgx/covid19). We also identified full HLA genotype data for 3,382 individuals whose HLA types were confined to the 145 HLA alleles studied here. Population allele and haplotype frequency data were aggregated by country as a mean of all constituent population allele or haplotype frequencies weighted by sample size of the population but not accounting for the representative ethnic demographic size of the population. Global allele frequency maps were generated using the rworldmap v1.3-6 package (78), with total global allele and haplotype frequency estimates calculated as the mean of per-country allele and haplotype frequencies, weighted by each country’s population in 2005.\n\nData availability.\nSource code is available at https://github.com/pdxgx/covid19 under the Massachusetts Institute of Technology (MIT) license. Data File S4 can be found at https://github.com/pdxgx/covid19/blob/master/supporting_data/Appendix_4.zip."}

LitCovid-PD-MONDO

{"project":"LitCovid-PD-MONDO","denotations":[{"id":"T94","span":{"begin":368,"end":376},"obj":"Disease"},{"id":"T95","span":{"begin":378,"end":386},"obj":"Disease"},{"id":"T96","span":{"begin":2178,"end":2186},"obj":"Disease"},{"id":"T97","span":{"begin":2399,"end":2407},"obj":"Disease"},{"id":"T98","span":{"begin":2640,"end":2648},"obj":"Disease"},{"id":"T99","span":{"begin":3086,"end":3094},"obj":"Disease"},{"id":"T100","span":{"begin":3101,"end":3109},"obj":"Disease"},{"id":"T101","span":{"begin":4172,"end":4180},"obj":"Disease"},{"id":"T102","span":{"begin":4475,"end":4483},"obj":"Disease"},{"id":"T103","span":{"begin":4497,"end":4505},"obj":"Disease"}],"attributes":[{"id":"A94","pred":"mondo_id","subj":"T94","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A95","pred":"mondo_id","subj":"T95","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A96","pred":"mondo_id","subj":"T96","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A97","pred":"mondo_id","subj":"T97","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A98","pred":"mondo_id","subj":"T98","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A99","pred":"mondo_id","subj":"T99","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A100","pred":"mondo_id","subj":"T100","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A101","pred":"mondo_id","subj":"T101","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A102","pred":"mondo_id","subj":"T102","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"},{"id":"A103","pred":"mondo_id","subj":"T103","obj":"http://purl.obolibrary.org/obo/MONDO_0005091"}],"text":"MATERIALS AND METHODS\n\nSequence retrieval and alignments.\nFull polyprotein 1ab (ORF1ab), spike (S) protein, membrane (M) protein, envelope (E) protein, and nucleocapsid (N) protein sequences were obtained for each of 34 distinct but representative alpha and betacoronaviruses from broad genus and subgenus distributions, including all known human coronaviruses (i.e., SARS-CoV, SARS-CoV-2, MERS-CoV, HKU1, OC43, NL63, and 229E). FASTA-formatted protein sequence data (the full accession number list is available in Table S5 in the supplemental material) were retrieved from the National Center of Biotechnology Information (NCBI) (67). For each of the protein classes (i.e., ORF1ab, S, M, E, and N), all 34 coronavirus sequences were aligned using the Clustal Omega v1.2.4 multisequence aligner tool employing the following parameters: sequence type [Protein], output alignment format [clustal_num], dealign [false], mBed-like clustering guide-tree [true], mBed-like clustering iteration [true], number of combined iterations 0, maximum guide tree iterations [-1], and maximum HMM iterations [-1] (68). For the purposes of estimating time of viral peptide production, we classified ORF1a and ORF1b peptides as “early” whereas all other peptides produced by subgenomic mRNAs were classified as “late” (69, 70).\n\nConserved peptide assessment.\nAligned sequences were imported into Jalview v. 2.1.1 (71) with automated generation of the following alignment annotations: (i) sequence consensus, calculated as the percentage of the modal residue per column; (ii) sequence conservation (0 to 11), measured as a numerical index reflecting conservation of amino acid physicochemical properties in the alignment; (iii) alignment quality (0 to 1), measured as a normalized sum of BLOSUM62 ratios for all residues at each position; and (iv) occupancy, calculated as the number of aligned residues (not including gaps) for each position. In all cases, sequence conservation was assessed for each of the following three groups: only human-infecting coronavirus sequences (n = 7), all betacoronavirus sequences (n = 16), and all alpha- and betacoronavirus sequences combined (n = 34). Aligned SARS-CoV-2 sequences and all annotations were manually exported for subsequent analysis. Conserved human coronavirus peptides were defined as those with a length of ≥8 consecutive amino acids, each showing agreement with SARS-CoV-2 sequences and ≥4 other human coronavirus sequences with the consensus sequence (Table S2). For each of these conserved peptides, we also assessed the component number of 8- to 12-mers sharing identical amino acid sequence between SARS-CoV-2 and each of the four other common human coronaviruses (i.e., OC43, HKU1, NL63, and 229E) (Table S3). For all peptides, human, beta, and combined conservation scores were obtained using a custom R v.3.6.2 script representing mean sequence conservation (minus gap penalties where relevant) (see https://github.com/pdxgx/covid19).\n\nPeptide-MHC class I binding affinity predictions.\nFASTA-formatted input protein sequences from the entire SARS-CoV-2 and SARS-CoV proteomes were obtained from the NCBI RefSeq database (67) under accession numbers NC_045512.2 and NC_004718.3. We kmerized each of these sequences into 8- to 12-mers to assess MHC class I-peptide binding affinity across the entire proteome. MHC class I binding affinity predictions were performed using 145 different HLA alleles for which global allele frequency data were available as described previously (72) (see Table S5) with netMHCpan v4.0 (73) using the ‘-BA’ option to include binding affinity predictions and the ‘-l’ option to specify peptides 8 to 12 amino acids in length (Table S1). Binding affinity was not predicted for peptides containing the character ‘|’ in their sequences. Additional MHC class I binding affinity predictions were performed on all 66 MHCflurry-supported alleles (–list-supported-alleles; Table S6) using both MHCnuggets 2.3.2 (74) and MHCflurry 1.4.3 (75) (see Tables S7, S8, and S9 and Fig. S7 to S10 in the supplemental material). We further cross-referenced these lists of peptides with existing experimentally validated SARS-CoV epitopes present in the Immune Epitope Database (Table S4) (76). We then performed consensus binding affinity predictions for the 66 supported alleles shared by all three tools by taking the union set of alleles and filtering for peptide-allele pairs matching the union set of alleles. For the SARS-CoV-specific and SARS-CoV-2-specific distributions of per-allele proteome presentation, we exclude all peptide-allele pairs with \u003e500 nM predicted binding. In all cases, we used the netchop v3.0 (77) “C-term” model with a cleavage threshold of 0.1 to further remove any peptides that were not predicted to undergo canonical MHC class I antigen processing via proteasomal cleavage (of the peptide’s C terminus).\n\nGlobal HLA allele and haplotype frequencies.\nHLA-A, -B, and -C allele and haplotype frequency data were obtained from the Allele Frequency Net Database (52) for 805 distinct populations pertaining to 101 different countries and 2,628 distinct major/minor (4-digit) alleles, corresponding to 20,478 distinct haplotypes (https://github.com/pdxgx/covid19). We also identified full HLA genotype data for 3,382 individuals whose HLA types were confined to the 145 HLA alleles studied here. Population allele and haplotype frequency data were aggregated by country as a mean of all constituent population allele or haplotype frequencies weighted by sample size of the population but not accounting for the representative ethnic demographic size of the population. Global allele frequency maps were generated using the rworldmap v1.3-6 package (78), with total global allele and haplotype frequency estimates calculated as the mean of per-country allele and haplotype frequencies, weighted by each country’s population in 2005.\n\nData availability.\nSource code is available at https://github.com/pdxgx/covid19 under the Massachusetts Institute of Technology (MIT) license. Data File S4 can be found at https://github.com/pdxgx/covid19/blob/master/supporting_data/Appendix_4.zip."}

LitCovid-PD-CLO

LitCovid-PD-CHEBI

LitCovid-PubTator

LitCovid-PD-GO-BP

{"project":"LitCovid-PD-GO-BP","denotations":[{"id":"T33","span":{"begin":2988,"end":2991},"obj":"http://purl.obolibrary.org/obo/GO_0046776"},{"id":"T34","span":{"begin":3287,"end":3290},"obj":"http://purl.obolibrary.org/obo/GO_0046776"},{"id":"T35","span":{"begin":3352,"end":3355},"obj":"http://purl.obolibrary.org/obo/GO_0046776"},{"id":"T36","span":{"begin":3816,"end":3819},"obj":"http://purl.obolibrary.org/obo/GO_0046776"},{"id":"T37","span":{"begin":4804,"end":4807},"obj":"http://purl.obolibrary.org/obo/GO_0046776"},{"id":"T38","span":{"begin":4816,"end":4834},"obj":"http://purl.obolibrary.org/obo/GO_0019882"}],"text":"MATERIALS AND METHODS\n\nSequence retrieval and alignments.\nFull polyprotein 1ab (ORF1ab), spike (S) protein, membrane (M) protein, envelope (E) protein, and nucleocapsid (N) protein sequences were obtained for each of 34 distinct but representative alpha and betacoronaviruses from broad genus and subgenus distributions, including all known human coronaviruses (i.e., SARS-CoV, SARS-CoV-2, MERS-CoV, HKU1, OC43, NL63, and 229E). FASTA-formatted protein sequence data (the full accession number list is available in Table S5 in the supplemental material) were retrieved from the National Center of Biotechnology Information (NCBI) (67). For each of the protein classes (i.e., ORF1ab, S, M, E, and N), all 34 coronavirus sequences were aligned using the Clustal Omega v1.2.4 multisequence aligner tool employing the following parameters: sequence type [Protein], output alignment format [clustal_num], dealign [false], mBed-like clustering guide-tree [true], mBed-like clustering iteration [true], number of combined iterations 0, maximum guide tree iterations [-1], and maximum HMM iterations [-1] (68). For the purposes of estimating time of viral peptide production, we classified ORF1a and ORF1b peptides as “early” whereas all other peptides produced by subgenomic mRNAs were classified as “late” (69, 70).\n\nConserved peptide assessment.\nAligned sequences were imported into Jalview v. 2.1.1 (71) with automated generation of the following alignment annotations: (i) sequence consensus, calculated as the percentage of the modal residue per column; (ii) sequence conservation (0 to 11), measured as a numerical index reflecting conservation of amino acid physicochemical properties in the alignment; (iii) alignment quality (0 to 1), measured as a normalized sum of BLOSUM62 ratios for all residues at each position; and (iv) occupancy, calculated as the number of aligned residues (not including gaps) for each position. In all cases, sequence conservation was assessed for each of the following three groups: only human-infecting coronavirus sequences (n = 7), all betacoronavirus sequences (n = 16), and all alpha- and betacoronavirus sequences combined (n = 34). Aligned SARS-CoV-2 sequences and all annotations were manually exported for subsequent analysis. Conserved human coronavirus peptides were defined as those with a length of ≥8 consecutive amino acids, each showing agreement with SARS-CoV-2 sequences and ≥4 other human coronavirus sequences with the consensus sequence (Table S2). For each of these conserved peptides, we also assessed the component number of 8- to 12-mers sharing identical amino acid sequence between SARS-CoV-2 and each of the four other common human coronaviruses (i.e., OC43, HKU1, NL63, and 229E) (Table S3). For all peptides, human, beta, and combined conservation scores were obtained using a custom R v.3.6.2 script representing mean sequence conservation (minus gap penalties where relevant) (see https://github.com/pdxgx/covid19).\n\nPeptide-MHC class I binding affinity predictions.\nFASTA-formatted input protein sequences from the entire SARS-CoV-2 and SARS-CoV proteomes were obtained from the NCBI RefSeq database (67) under accession numbers NC_045512.2 and NC_004718.3. We kmerized each of these sequences into 8- to 12-mers to assess MHC class I-peptide binding affinity across the entire proteome. MHC class I binding affinity predictions were performed using 145 different HLA alleles for which global allele frequency data were available as described previously (72) (see Table S5) with netMHCpan v4.0 (73) using the ‘-BA’ option to include binding affinity predictions and the ‘-l’ option to specify peptides 8 to 12 amino acids in length (Table S1). Binding affinity was not predicted for peptides containing the character ‘|’ in their sequences. Additional MHC class I binding affinity predictions were performed on all 66 MHCflurry-supported alleles (–list-supported-alleles; Table S6) using both MHCnuggets 2.3.2 (74) and MHCflurry 1.4.3 (75) (see Tables S7, S8, and S9 and Fig. S7 to S10 in the supplemental material). We further cross-referenced these lists of peptides with existing experimentally validated SARS-CoV epitopes present in the Immune Epitope Database (Table S4) (76). We then performed consensus binding affinity predictions for the 66 supported alleles shared by all three tools by taking the union set of alleles and filtering for peptide-allele pairs matching the union set of alleles. For the SARS-CoV-specific and SARS-CoV-2-specific distributions of per-allele proteome presentation, we exclude all peptide-allele pairs with \u003e500 nM predicted binding. In all cases, we used the netchop v3.0 (77) “C-term” model with a cleavage threshold of 0.1 to further remove any peptides that were not predicted to undergo canonical MHC class I antigen processing via proteasomal cleavage (of the peptide’s C terminus).\n\nGlobal HLA allele and haplotype frequencies.\nHLA-A, -B, and -C allele and haplotype frequency data were obtained from the Allele Frequency Net Database (52) for 805 distinct populations pertaining to 101 different countries and 2,628 distinct major/minor (4-digit) alleles, corresponding to 20,478 distinct haplotypes (https://github.com/pdxgx/covid19). We also identified full HLA genotype data for 3,382 individuals whose HLA types were confined to the 145 HLA alleles studied here. Population allele and haplotype frequency data were aggregated by country as a mean of all constituent population allele or haplotype frequencies weighted by sample size of the population but not accounting for the representative ethnic demographic size of the population. Global allele frequency maps were generated using the rworldmap v1.3-6 package (78), with total global allele and haplotype frequency estimates calculated as the mean of per-country allele and haplotype frequencies, weighted by each country’s population in 2005.\n\nData availability.\nSource code is available at https://github.com/pdxgx/covid19 under the Massachusetts Institute of Technology (MIT) license. Data File S4 can be found at https://github.com/pdxgx/covid19/blob/master/supporting_data/Appendix_4.zip."}

LitCovid-PD-GlycoEpitope

{"project":"LitCovid-PD-GlycoEpitope","denotations":[{"id":"T4","span":{"begin":4814,"end":4823},"obj":"GlycoEpitope"}],"attributes":[{"id":"A4","pred":"glyco_epitope_db_id","subj":"T4","obj":"http://www.glycoepitope.jp/epitopes/EP0138"}],"text":"MATERIALS AND METHODS\n\nSequence retrieval and alignments.\nFull polyprotein 1ab (ORF1ab), spike (S) protein, membrane (M) protein, envelope (E) protein, and nucleocapsid (N) protein sequences were obtained for each of 34 distinct but representative alpha and betacoronaviruses from broad genus and subgenus distributions, including all known human coronaviruses (i.e., SARS-CoV, SARS-CoV-2, MERS-CoV, HKU1, OC43, NL63, and 229E). FASTA-formatted protein sequence data (the full accession number list is available in Table S5 in the supplemental material) were retrieved from the National Center of Biotechnology Information (NCBI) (67). For each of the protein classes (i.e., ORF1ab, S, M, E, and N), all 34 coronavirus sequences were aligned using the Clustal Omega v1.2.4 multisequence aligner tool employing the following parameters: sequence type [Protein], output alignment format [clustal_num], dealign [false], mBed-like clustering guide-tree [true], mBed-like clustering iteration [true], number of combined iterations 0, maximum guide tree iterations [-1], and maximum HMM iterations [-1] (68). For the purposes of estimating time of viral peptide production, we classified ORF1a and ORF1b peptides as “early” whereas all other peptides produced by subgenomic mRNAs were classified as “late” (69, 70).\n\nConserved peptide assessment.\nAligned sequences were imported into Jalview v. 2.1.1 (71) with automated generation of the following alignment annotations: (i) sequence consensus, calculated as the percentage of the modal residue per column; (ii) sequence conservation (0 to 11), measured as a numerical index reflecting conservation of amino acid physicochemical properties in the alignment; (iii) alignment quality (0 to 1), measured as a normalized sum of BLOSUM62 ratios for all residues at each position; and (iv) occupancy, calculated as the number of aligned residues (not including gaps) for each position. In all cases, sequence conservation was assessed for each of the following three groups: only human-infecting coronavirus sequences (n = 7), all betacoronavirus sequences (n = 16), and all alpha- and betacoronavirus sequences combined (n = 34). Aligned SARS-CoV-2 sequences and all annotations were manually exported for subsequent analysis. Conserved human coronavirus peptides were defined as those with a length of ≥8 consecutive amino acids, each showing agreement with SARS-CoV-2 sequences and ≥4 other human coronavirus sequences with the consensus sequence (Table S2). For each of these conserved peptides, we also assessed the component number of 8- to 12-mers sharing identical amino acid sequence between SARS-CoV-2 and each of the four other common human coronaviruses (i.e., OC43, HKU1, NL63, and 229E) (Table S3). For all peptides, human, beta, and combined conservation scores were obtained using a custom R v.3.6.2 script representing mean sequence conservation (minus gap penalties where relevant) (see https://github.com/pdxgx/covid19).\n\nPeptide-MHC class I binding affinity predictions.\nFASTA-formatted input protein sequences from the entire SARS-CoV-2 and SARS-CoV proteomes were obtained from the NCBI RefSeq database (67) under accession numbers NC_045512.2 and NC_004718.3. We kmerized each of these sequences into 8- to 12-mers to assess MHC class I-peptide binding affinity across the entire proteome. MHC class I binding affinity predictions were performed using 145 different HLA alleles for which global allele frequency data were available as described previously (72) (see Table S5) with netMHCpan v4.0 (73) using the ‘-BA’ option to include binding affinity predictions and the ‘-l’ option to specify peptides 8 to 12 amino acids in length (Table S1). Binding affinity was not predicted for peptides containing the character ‘|’ in their sequences. Additional MHC class I binding affinity predictions were performed on all 66 MHCflurry-supported alleles (–list-supported-alleles; Table S6) using both MHCnuggets 2.3.2 (74) and MHCflurry 1.4.3 (75) (see Tables S7, S8, and S9 and Fig. S7 to S10 in the supplemental material). We further cross-referenced these lists of peptides with existing experimentally validated SARS-CoV epitopes present in the Immune Epitope Database (Table S4) (76). We then performed consensus binding affinity predictions for the 66 supported alleles shared by all three tools by taking the union set of alleles and filtering for peptide-allele pairs matching the union set of alleles. For the SARS-CoV-specific and SARS-CoV-2-specific distributions of per-allele proteome presentation, we exclude all peptide-allele pairs with \u003e500 nM predicted binding. In all cases, we used the netchop v3.0 (77) “C-term” model with a cleavage threshold of 0.1 to further remove any peptides that were not predicted to undergo canonical MHC class I antigen processing via proteasomal cleavage (of the peptide’s C terminus).\n\nGlobal HLA allele and haplotype frequencies.\nHLA-A, -B, and -C allele and haplotype frequency data were obtained from the Allele Frequency Net Database (52) for 805 distinct populations pertaining to 101 different countries and 2,628 distinct major/minor (4-digit) alleles, corresponding to 20,478 distinct haplotypes (https://github.com/pdxgx/covid19). We also identified full HLA genotype data for 3,382 individuals whose HLA types were confined to the 145 HLA alleles studied here. Population allele and haplotype frequency data were aggregated by country as a mean of all constituent population allele or haplotype frequencies weighted by sample size of the population but not accounting for the representative ethnic demographic size of the population. Global allele frequency maps were generated using the rworldmap v1.3-6 package (78), with total global allele and haplotype frequency estimates calculated as the mean of per-country allele and haplotype frequencies, weighted by each country’s population in 2005.\n\nData availability.\nSource code is available at https://github.com/pdxgx/covid19 under the Massachusetts Institute of Technology (MIT) license. Data File S4 can be found at https://github.com/pdxgx/covid19/blob/master/supporting_data/Appendix_4.zip."}

LitCovid-sentences

2_test

{"project":"2_test","denotations":[{"id":"32303592-26553804-65501646","span":{"begin":631,"end":633},"obj":"26553804"},{"id":"32303592-21988835-65501647","span":{"begin":1098,"end":1100},"obj":"21988835"},{"id":"32303592-15680415-65501648","span":{"begin":1301,"end":1303},"obj":"15680415"},{"id":"32303592-19151095-65501649","span":{"begin":1396,"end":1398},"obj":"19151095"},{"id":"32303592-26553804-65501650","span":{"begin":3165,"end":3167},"obj":"26553804"},{"id":"32303592-29653567-65501651","span":{"begin":3519,"end":3521},"obj":"29653567"},{"id":"32303592-28978689-65501652","span":{"begin":3559,"end":3561},"obj":"28978689"},{"id":"32303592-31871119-65501653","span":{"begin":3975,"end":3977},"obj":"31871119"},{"id":"32303592-29960884-65501654","span":{"begin":4000,"end":4002},"obj":"29960884"},{"id":"32303592-30357391-65501655","span":{"begin":4241,"end":4243},"obj":"30357391"},{"id":"32303592-15744535-65501656","span":{"begin":4676,"end":4678},"obj":"15744535"},{"id":"32303592-25414323-65501657","span":{"begin":5045,"end":5047},"obj":"25414323"}],"text":"MATERIALS AND METHODS\n\nSequence retrieval and alignments.\nFull polyprotein 1ab (ORF1ab), spike (S) protein, membrane (M) protein, envelope (E) protein, and nucleocapsid (N) protein sequences were obtained for each of 34 distinct but representative alpha and betacoronaviruses from broad genus and subgenus distributions, including all known human coronaviruses (i.e., SARS-CoV, SARS-CoV-2, MERS-CoV, HKU1, OC43, NL63, and 229E). FASTA-formatted protein sequence data (the full accession number list is available in Table S5 in the supplemental material) were retrieved from the National Center of Biotechnology Information (NCBI) (67). For each of the protein classes (i.e., ORF1ab, S, M, E, and N), all 34 coronavirus sequences were aligned using the Clustal Omega v1.2.4 multisequence aligner tool employing the following parameters: sequence type [Protein], output alignment format [clustal_num], dealign [false], mBed-like clustering guide-tree [true], mBed-like clustering iteration [true], number of combined iterations 0, maximum guide tree iterations [-1], and maximum HMM iterations [-1] (68). For the purposes of estimating time of viral peptide production, we classified ORF1a and ORF1b peptides as “early” whereas all other peptides produced by subgenomic mRNAs were classified as “late” (69, 70).\n\nConserved peptide assessment.\nAligned sequences were imported into Jalview v. 2.1.1 (71) with automated generation of the following alignment annotations: (i) sequence consensus, calculated as the percentage of the modal residue per column; (ii) sequence conservation (0 to 11), measured as a numerical index reflecting conservation of amino acid physicochemical properties in the alignment; (iii) alignment quality (0 to 1), measured as a normalized sum of BLOSUM62 ratios for all residues at each position; and (iv) occupancy, calculated as the number of aligned residues (not including gaps) for each position. In all cases, sequence conservation was assessed for each of the following three groups: only human-infecting coronavirus sequences (n = 7), all betacoronavirus sequences (n = 16), and all alpha- and betacoronavirus sequences combined (n = 34). Aligned SARS-CoV-2 sequences and all annotations were manually exported for subsequent analysis. Conserved human coronavirus peptides were defined as those with a length of ≥8 consecutive amino acids, each showing agreement with SARS-CoV-2 sequences and ≥4 other human coronavirus sequences with the consensus sequence (Table S2). For each of these conserved peptides, we also assessed the component number of 8- to 12-mers sharing identical amino acid sequence between SARS-CoV-2 and each of the four other common human coronaviruses (i.e., OC43, HKU1, NL63, and 229E) (Table S3). For all peptides, human, beta, and combined conservation scores were obtained using a custom R v.3.6.2 script representing mean sequence conservation (minus gap penalties where relevant) (see https://github.com/pdxgx/covid19).\n\nPeptide-MHC class I binding affinity predictions.\nFASTA-formatted input protein sequences from the entire SARS-CoV-2 and SARS-CoV proteomes were obtained from the NCBI RefSeq database (67) under accession numbers NC_045512.2 and NC_004718.3. We kmerized each of these sequences into 8- to 12-mers to assess MHC class I-peptide binding affinity across the entire proteome. MHC class I binding affinity predictions were performed using 145 different HLA alleles for which global allele frequency data were available as described previously (72) (see Table S5) with netMHCpan v4.0 (73) using the ‘-BA’ option to include binding affinity predictions and the ‘-l’ option to specify peptides 8 to 12 amino acids in length (Table S1). Binding affinity was not predicted for peptides containing the character ‘|’ in their sequences. Additional MHC class I binding affinity predictions were performed on all 66 MHCflurry-supported alleles (–list-supported-alleles; Table S6) using both MHCnuggets 2.3.2 (74) and MHCflurry 1.4.3 (75) (see Tables S7, S8, and S9 and Fig. S7 to S10 in the supplemental material). We further cross-referenced these lists of peptides with existing experimentally validated SARS-CoV epitopes present in the Immune Epitope Database (Table S4) (76). We then performed consensus binding affinity predictions for the 66 supported alleles shared by all three tools by taking the union set of alleles and filtering for peptide-allele pairs matching the union set of alleles. For the SARS-CoV-specific and SARS-CoV-2-specific distributions of per-allele proteome presentation, we exclude all peptide-allele pairs with \u003e500 nM predicted binding. In all cases, we used the netchop v3.0 (77) “C-term” model with a cleavage threshold of 0.1 to further remove any peptides that were not predicted to undergo canonical MHC class I antigen processing via proteasomal cleavage (of the peptide’s C terminus).\n\nGlobal HLA allele and haplotype frequencies.\nHLA-A, -B, and -C allele and haplotype frequency data were obtained from the Allele Frequency Net Database (52) for 805 distinct populations pertaining to 101 different countries and 2,628 distinct major/minor (4-digit) alleles, corresponding to 20,478 distinct haplotypes (https://github.com/pdxgx/covid19). We also identified full HLA genotype data for 3,382 individuals whose HLA types were confined to the 145 HLA alleles studied here. Population allele and haplotype frequency data were aggregated by country as a mean of all constituent population allele or haplotype frequencies weighted by sample size of the population but not accounting for the representative ethnic demographic size of the population. Global allele frequency maps were generated using the rworldmap v1.3-6 package (78), with total global allele and haplotype frequency estimates calculated as the mean of per-country allele and haplotype frequencies, weighted by each country’s population in 2005.\n\nData availability.\nSource code is available at https://github.com/pdxgx/covid19 under the Massachusetts Institute of Technology (MIT) license. Data File S4 can be found at https://github.com/pdxgx/covid19/blob/master/supporting_data/Appendix_4.zip."}

PMC:7307149 / 21550-27712 JSONTXT

Annnotations TAB JSON ListView MergeView

PMC:7307149 / 21550-27712 JSON TXT