> top > docs > CORD-19:2c006d09b6fccc527bf5ee3de0f165b018c39e73

CORD-19:2c006d09b6fccc527bf5ee3de0f165b018c39e73 JSONTXT

Potential Neutralizing Antibodies Discovered for Novel Corona Virus Using Machine Learning Abstract The fast and untraceable virus mutations take lives of thousands of people before the immune system can produce the inhibitory antibody. Recent outbreak of novel coronavirus infected and killed thousands of people in the world. Rapid methods in finding peptides or antibody sequences that can inhibit the viral epitopes of COVID-19 will save the life of thousands. In this paper, we devised a machine learning (ML) model to predict the possible inhibitory synthetic antibodies for Corona virus. We collected 1933 virus-antibody sequences and their clinical patient neutralization response and trained an ML model to predict the antibody response. Using graph featurization with variety of ML methods, we screened thousands of hypothetical antibody sequences and found 8 stable antibodies that potentially inhibit COVID-19. We combined bioinformatics, structural biology, and Molecular Dynamics (MD) simulations to verify the stability of the candidate antibodies that can inhibit the Corona virus. Keywords: Coronavirus, COVID-19, Machine Learning, Antibody Engineering, Bio-informatics, The biomolecular process for recognition and neutralization of viral particles is through the process of viral antigen presentation and recruitment of appropriate B cells to synthesize the neutralizing antibodies. 1 Theoretically, this process allows the immune system to stop any viral invasion, but this response is slow and often requires days, even weeks before adequate immune response can be achieved. 2, 3 This poses a challenging question: Can the process of antibody discovery be accelerated to counter highly infective viral diseases? With the rapid expansion of available biological data, such as DNA/protein sequences and structures 4 , it is now possible to model and predict the complex biological phenomena through machine learning (ML) approaches. Given sufficient training data, ML can be used to learn a mapping between the viral epitope and effectiveness of its complementary antibody. 5 Once such mapping is learnt, it can be used to predict potentially neutralizing antibody for a given viral sequence 6 . ML can essentially learn the complex antigen-antibody interactions faster than human immune system, leading to the generation of synthetic inhibitory antibodies acting as a bridge, which can overcome the latency between viral infection and human immune system response. This bridge can potentially save the life of many especially during an outbreak and pandemic. One such instance is the spread of coronavirus disease (COVID- 19) 7 . With incredibly high infectivity and mortality rate, COVID-19 has become a global scare. 8, 9 To compound the problem, there are no proven therapeutics to aid the suffering patients 2, 8, [10] [11] [12] [13] [14] . Only viable treatment at the moment is symptomatic and there is a desperate need for developing therapeutics to counter COVID-19. Recently, the proteomics sequences of 'WH-Human 1' coronavirus became available through Metagenomic RNA sequencing of a patient in Wuhan. 4, 15 WH-Human 1 is 89.1% similar to a group of SARS-like coronaviruses. 4 With this sequence available, it is possible to find potential inhibitory antibodies by scanning thousands of antibody sequences and discovering the neutralizing ones [16] [17] [18] . However, this requires very expensive and time-consuming experimentation to discover the inhibitory responses to Corona virus in a timely manner. In addition, computational and physics-based models require the bound crystal structure of antibody-antigen complex, however; only a few of these structures have become available. 19, 20, 21, 22 In the case of COVID-19, the bound antigen-antibody crystal structure is not available to-date 23, 24 . Given this challenge and the fact that ML models require a large amount of data, the ML approach should rely on the sequences of the antibody-antigen rather than the crystal structures 25 . In this paper, we have collected a dataset comprised of antibody-antigen sequences of variety of viruses including HIV, Influenza, Dengue, SARS, Ebola, Hepatitis, etc. with their patient clinical/biochemical IC50 data. Using this dataset (we call it VirusNet), we trained and benchmarked different shallow and deep ML models and selected the best performing model. Based on SARS 2006 neutralizing antibody scaffold 26 , we created thousands of potential antibody candidates by mutation and screened them with our best performing ML model. Finally, molecular dynamics (MD) simulations were performed on the neutralizing candidates to check their structural stability. We predict 8 structures that were stable over the course of simulation and are potential neutralizing antibodies for COVID-19. In addition, we interpreted the ML method to understand what alterations in the sequence of binding region of the antibody would most effectively counter the viral mutation(s) and restore the ability of the antibody to bind to the virus 27 . This information is critical in terms of antibody design and engineering and reducing the dimension of combinatoric mutations needed to find a neutralizing antibody. The majority of the data in the training set is composed of HIV antibody-antigen complex (1887 samples). Most of the samples for the HIV training set were obtained from the Compile, Analyze and Tally NAb panels (CATNAP) database from the Los Alamos National Laboratory (LANL) 28, 29 . From CATNAP, data was collected for monoclonal antibodies, 2F5, 4E10 and 10E8, which bind with GP41 30-32 . Using CATNAP's functionality for identifying epitope alignment, we selected FASTA sequence of the antigen corresponding to the site of alignment, in the antibody. We To make the dataset more diverse and train a more robust ML model, we included more available antibody-antigen sequences and their neutralization potential. To do this, we compiled the sequences of Influenza, Dengue, Ebola, SARS, Hepatitis, etc. 26,33-86 by searching the keywords of "virus, antibody" on RCSB server 87 and selected the neutralizing complex by reading their corresponding publications. Furthermore, for each neutralizing complex, the contact residues at the interface of antibody and antigen were selected. To select the antigen contact sequences, all amino acids within 5Å of corresponding antibody were chosen. (Supporting Information) To select the antibody contact sequences, all amino acids within 5Å of the antigen were chosen. In total, 102 sequences of antibody-antigen complexes were mined and added to the 1831 samples, resulting in total number of 1933 training samples. For effective representation of molecular structure of amino acids, the individual atoms of amino acids of antibody and antigen were treated as undirected graph, where the atoms are nodes and bonds are edges 88 . It has been shown that graph representation is better in transferring the chemistry and topology of molecular structure compared to Extended Connectivity Fingerprints (ECFP) 88, 89 . We construct these molecular graphs using RDkit 90 . Embeddings are generated to encode relevant features about the molecular graph 91, 92 . These embeddings encode information like the type of atom, valency of an atom, hybridization state, aromaticity etc. First, each antibody and antigen were encoded into separate embeddings and then concatenated into a single embedding for the entire antibody-antigen complex. We then apply mean pooling over the features for this concatenated embedding to ensure dimensional consistency across the training data. The pooled information is then passed to classifier algorithms like XGBoost 93 , Random Forest 94 , Multilayer perceptron, Support Vector Machine (SVM) 95 and Logistic Regression which then predict whether the antibody is capable of neutralizing the virus. In order to find potential antibody candidates for COVID-19, 2589 different mutant strains of antibody sequences were generated based on the sequence of SARS neutralizing antibodies. The reason we selected these antibodies as initial scaffolds is that the genome of COVID-19 4 is 79.8% identical to "Tor2" isolate of SARS (Accession number: AY274119) 96 The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint the amino acids in the binding region of antibody. (see Supporting Information for COVID-19 antigen and antibody interactions) To find out the binding region of these antibodies for sequence generation, all amino acids within 5Å of their respective antigen were chosen. To assess the biological feasibility of these mutant sequences, we scored each mutation by using the BLOSUM62 matrix 97 . To assess the stability of proposed antibody structures, we performed molecular dynamics (MD) simulations of each of antibody structure in a solvated environment 98 . The simulation of solvated antibody was carried out using GROMACS-5.1.4 [99] [100] [101] , and topologies for each antibody were generated according the GROMOS 54a7 102 forcefield. The protein was centered in a box, extending 1 nanometer from surface of the protein. This box was the solvated by the SPC216 model water atoms, pre-equilibrated at 300K. The antibody system in general carried a net positive charge and it was neutralized by the counter ions. Energy minimization was carried out using steepest descent algorithm, while restraining the peptide backbone to remove the steric clashes in atoms and subsequently optimize solvent molecule geometry. The cut-off distance criteria for this minimization were forces less than 100.0 kJ/mol/nm or number of steps exceeding 50,000. This minimized structure was the sent to two rounds of equilibration at 300K. First, an NVT ensemble for 50 picoseconds and a 2-femtosecond time step. Leapfrog dynamics integrator was used with Verlet scheme, neighbor-list was updated every 10 steps. All the ensembles were under Periodic Boundary Conditions and harmonic constraints were applied by the LINCS algorithm 103 ; under this scheme the long-range electrostatic interactions were computed by Particle Mesh Ewald (PME) algorithm 104 . Berendsen thermostat was used for temperature coupling and pressure coupling was done using the Parrinello-. CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint Rahman barostat 105, 106 . The last round of NPT simulation ensures that the simulated system is at physiological temperature and pressure. The system volume was free to change in the NPT ensemble but in fact did not change significantly during the course of the simulation. Following the rounds of equilibration, production run for the system was carried out in NPT and no constraints for a total of 15 nanoseconds, under identical simulation parameters. The flowchart of COVID-19 antibody discovery using ML has four major steps ( The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint test is as follows: Influenza (84.61%), Dengue (100%), Ebola (75%), Hepatitis (75%), SARS (100%). The out of class results demonstrate that our model is capable of generalizing the prediction to a completely novel virus epitope. Since COVID-19 is completely a new virus, we can conclude that our model prediction performance will be accurate. The fact that our model prediction is 100% for SARS out of class test demonstrate its capability of effectively predicting the antibodies for COVID-19 which is from SARS family. In order to be more comprehensive, we created co-mutations out of 5 stable point mutations (C3, C7, C14, C17, C18, see Table S1 in Supporting Information for the list of all 18 candidates). This resulted . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint in 5 new structures (Co1, Co2, Co3, Co4, Co5 in Table S2 ) that were screened using XGBoost for neutralization. Among all 5 co-mutations, Co5 did not neutralize. To check the stability of these 4 neutralizing co-mutations, MD simulations were performed and Co1, Co2 and Co4 were found to be stable (Figure 4b) . The list of the final 8 stable mutations and co-mutations are tabulated in Table 1 and the PDB structures are available as Supporting Information. We have developed a machine learning model for high throughput screening of synthetic antibodies to discover antibodies that potentially inhibit the COVID-19. Our approach can be widely applied to other viruses where only the sequences of viral coat protein-antibody pairs can be obtained. The ML models were trained on 14 different virus types and achieved over 90% fivefold test accuracy. The out of class prediction is 100% for SARS and 84.61% for Influenza, demonstrating the power of our model for neutralization prediction of antibodies for novel viruses like COVID-19. Using this model, the neutralization of thousands of hypothetical antibodies was predicted, and 18 antibodies were found to be highly efficient in neutralizing COVID-19. Using MD simulations, the stability of predicted antibodies were checked and 8 stable antibodies were found that can neutralize COVID-19. In addition, the interpretation of ML model revealed that mutating to Methionine and Tyrosine is highly efficient in enhancing the affinity of antibodies to COVID-19. . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint Jayan for her support and Junhan Li for his help in collecting the data. The RMSD and contact distance plots for all the trajectories versus time, the structure of virus antibody complex and the residues at the contact region, native contacts in antigen-antibody complex, the interaction of COVID-19 epitope with 2GHW antibody, tables for all neutralizing point mutations, co-mutations and their neutralization potentials, the structures of stable antibodies in PDB format, and IC50 data interpretation are available online. The VirusNet dataset will be available upon request. . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint Figure 1 . Designing antibodies or peptide sequences that can inhibit the COVID-19 virus requires high throughput experimentation of vastly mutated sequences of potential inhibitors. The screening of thousands of available strains of antibodies are prohibitively expensive, and not feasible due to lack of available structures. However, machine learning models can enable the rapid and inexpensive exploration of vast sequence space on the computer in a fraction of seconds. We collected 1933 virusantibody sequences with clinical patient IC50 data. Graph featurization of antibody-antigen sequences creates a unique molecular representation. Using graph representation, we benchmarked and used a variety of shallow and deep learning models and selected XGBoost because of its superior performance and interpretability. We trained our model using a dataset including 1,933 diverse virus epitope and the antibodies. To generate the hypothetical antibody library, we mutated the SARS scaffold antibody of 2006 (PDB:2GHW) and generated thousands of possible candidates. Using the ML model, we classified these sequences and selected the top 18 sequences that will neutralize COVID-19 with high confidence. We used MD simulations to check the stability of the 18 sequences and rank them based on their stability. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint Figure 3 . a) The test accuracy with five-fold cross validation for XG-Boost, Random Forrest (RF), Logistic Regression (LR), Support Vector Machine (SVM) and Deep Learning (Multilayer Perceptron. XGBoost has the highest performance with (90.75%). b) Out of training class test accuracy for influenza, Dengue, Ebola, Hepatitis, and SARS. To perform this test, for example for influenza, all the influenza virus-antibody sequences were removed from the training set and the obtained model were tested on all samples of Influenza and the accuracy is reported here. c) Blossum validated mutations, non-neutralizing and neutralizing antibody sequences. To achieve more confidence, we set the threshold of prediction probability to 0.9895 in XGBoost and found 18 neutralizing antibody sequences (the green points). d) Interpretability of ML model: to understand what mutations are playing the key roles in neutralization, XGBoost feature importance used with ranked atomic level features. Through connecting the atomic features with each of 20 amino acids, M was found to be the most important amino acids in neutralization followed by F, Y, W. The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint . CC-BY-NC-ND 4.0 International license author/funder. It is made available under a The copyright holder for this preprint (which was not peer-reviewed) is the . https://doi.org/10.1101/2020.03.14.992156 doi: bioRxiv preprint

projects that include this document

Unselected / annnotation Selected / annnotation