Synthetic lethals in HIV: ways to avoid drug resistance
Background: RNA viruses rapidly accumulate genetic variation, which can give rise to synthetic lethal (SL) and deleterious (SD) mutations. Synthetic lethal mutations (non-lethal when alone but lethal when combined in one genome) have been studied to develop cancer therapies. This principle can also be used against fast-evolving RNA-viruses. Indeed, targeting protein sites involved in SD + SL interactions with a drug would render any mutation of such sites, lethal.
Results: Here, we set up a strategy to detect intragenic pairs of SL and SD at the surface of the protein to predict less escapable drug target sites. For this, we detected SD + SL, studying HIV protease (PR) and reverse transcriptase (RT) sequence alignments from two groups of VIH + individuals: treated with drugs (T) or not (NT). Using a series of statistical approaches, we were able to propose bona fide SD + SL couples. When focusing on spatially close co-variant SD + SL couples at the surface of the protein, we found 5 SD + SL groups (2 in the protease and 3 in the reverse transcriptase), which could be good candidates to form pockets to accommodate potential drugs.
Conclusions: Thus, designing drugs targeting these specific SD + SL groups would not allow the virus to mutate any residue involved in such groups without losing an essential function. Moreover, we also show that the selection pressure induced by the treatment leads to the appearance of new mutations, which change the mutational landscape of the protein. This drives the existence of differential SD + SL couples between the drug-treated and non-treated groups. Thus, new anti-viral drugs should be designed differently to target such groups.
Reviewers: This article was reviewed by Neil Greenspan Csaba Pal and István Simon.
Classical SL genetic interactions involve non-lethal mutations (carried by two or more genes) whose combination leads to cell death. They have been extensively used to study gene-product interactions in the secretion pathway of yeast  and bacteria  . Then, they were used to develop anti-cancer therapies     by pinpointing a gene (say, X) whose inactivation forms a pair of SL with a mutated cancer-causing gene. In this context, the drug will target gene X and not the gene responsible for the disease. The synthetic lethality relationship appears when the product of gene X is rendered non-functional by the action of the drug. Thus, the existence of both nonfunctional proteins provokes a lethal phenotype and leads to cancer cell death. The effect of the drug on normal cells, would not change their phenotype, and thus should not induce any secondary effects. Based on this paradigm, we describe a slightly different concept to uncover new druggable targets in RNA viruses using an intragenic SLbased strategy  . Indeed, RNA-viruses can escape drugs  and vaccines  , due to mutation of the targets against which such therapeutic molecules are developed. To circumvent this problem, pocket-binding drugs targeting viral fundamental functions should be pinpointed, so that the virus cannot mutate without losing the relevant essential function (Figure four in  ). Invariant residues fulfill this condition but they are rare in the proteins of RNAviruses. It is this notion, of "invariance", that we extend to a group of residues. Intragenic SL and synthetic deleterious (SD) can be exploited for this purpose. For simplicity, we call SDL the ensemble of SD + SL. A group of amino acids, spatially close (say, less than 10 Å between two residues) and located at the protein surface, can provide a suitable therapeutic target. These residues should be either invariant or being members of the same SL group. Due to these two features, essentiality for protein function and invariance, these targets are unique in that they might minimize or even prevent viral escape to treatment.
Various studies have been performed to describe pairwise and higher-order site correlations within RNA-virus proteins         employing various approaches such as information theory, non synonymous versus synonymous mutations, Bayesian networks, etc. Using generalized kernel ridge regression and maximum entropy models, others [17, 18] have described a general and interesting concept, which is the fitness landscape. Unfortunately, their goal was not to make the difference between compensatory mutations (CM) and SL pairs. Moreover, they were not interested in pointing to potential druggable sites, which is one of our main aims here. Further works were specifically developed on the viral RNA SL but they simulate them rather than detecting them [19, 20] .
In a preliminary work taking the HIV protease as a model, [7, 21, 22] , we described positions involved in SDL couples. The method used yielded results comparable to those obtained by other teams working on the same subject    . However, the sole knowledge of the amino acid (AA) positions is just part of the molecular picture and knowing the exact nature of the AAs involved in the SDL couples is just as important. Moreover, a SDL couple is not expected to exist alone, but rather within the context of a mutational network involving other couples of SDL and CM. Finally, to uncover functional covariation we must exclude background linkage disequilibrium (BLD). In sum, from a sequence alignment and a three-dimensional structure, we developed a strategy involving statistical tests, phylogeny, 3D structure and binding sites for constructing an in silico tool that predicts potential therapeutic targets. This tool has been tested on two HIV proteins, the protease (PR) and the reverse transcriptase (RT) and allowed us to describe five targets consisting of SL and invariant positions that should greatly minimize the emergence of drug resistance.
Steps to predict drug targets in silico
To define protein regions as potential druggable targets avoiding therapeutic escape, we have focused on SDL couples (Figure four in  ) and invariant positions located in their vicinities. To do so, seven steps are necessary. We need first to identify pairs of interdependent sites. They were defined by examining the variant positions (those having accumulated more than 0.3% of mutations). Specifically, these variant positions were tested in pairs using statistical tests, described in the Material and Methods (MM) section, commonly used to define dependencies between positions. Couples responding positively to 3 of the 4 tests were taken as interdependent pairs. A couple of residues may co/anti-vary for two main reasons: they can be either an interdependent couple (CM or SDL) or be derived from BLD. Only SDLs qualify for druggable targets not allowing therapeutic escape. SDL couples were defined as those having a number of observed pairs of mutated residues smaller than the number of expected pairs. Thus, we have defined a dissimilarity coefficient ξ, which is negative for SDL couples, and positive for CM couples (see Material and Methods). We filtered the results of this exploration to keep only the pairs located on the protein surface as it is the most accessible location for known therapeutic targets (accessibility threshold greater than 25%, using the ASA software  and implemented by Allan et al.  based on the 3D PR structure PDB ID:1HSG  and 3D RT structure PDB ID:1DLO  ). Next, we had to prove that SDL couples did not derive from a common ancestor (i.e. exclude BLD). Position couples underwent a further test: for all codons underlying these two positions, we computed the number of synonymous (S) and non synonymous (A) mutations. If the number of non-synonymous mutation pairs (A-A) was twice as much the number of synonymous mutation pairs (S-S) we considered that this pair of codons undergoes a positive selective pressure. Such pairs were therefore assumed not to derive from a common ancestor, in other words, not derived from BDL ( Figure 1E for PR and 2 in brown for RT). Keeping in mind the idea of suggesting druggable targets, we retained only SDL couples that were close in space (at less than 10 Å on the 3D structures). Finally, "invariant" positions (<0.3% of mutations at the relevant position relative to the ancestral sequence), although infrequent, can also be taken into account in the design of inescapable drug targets. Thus, all invariant positions being at less than 10 Å from SDL positions were also kept. The last step was to determine the drugability of a group of residues. As a first approach, we used the Q-SiteFinder software to list most important binding sites of a protein from its 3D structure. Figure 1E for PR and Figure 2 in the brown area for RT, represent the BLD, the weaker it is, the bigger is the chance for a pair of interdependent residues to come from a common ancestor. The pairs successful for 3 statistical tests and not derived from a common ancestor are represented on a heatmap for PR ( Figure 1 ) and a Venn Diagram for RT ( Figure 2 ). We compared our results with those of Rhee et al.  . Of the 49 interdependent pairs  and used to detect gametic disequilibrium. This result is given if Θ is >1.5 or <0.5  . The D' coefficient has a value between −1 and 1. D: the correlated coefficient r 2 . The result is given if Θ is >1.5 or <0.5  . The r 2 -correlated coefficient has a value between 0 and 1. they describe for patients under anti-RT treatments, only 5 are close in space and on the surface of a RT. These five couples are positive for our interdependence tests but two of them where rejected by our BLD test. Indeed, this test was not performed by Rhee et al. Concerning the PR, out of 49 interdependent couples described by Rhee et al.  only one is close in space and at the PR surface. We found this positive one with our algorithm. Our previous results  and those of three other groups    , were also confirmed by this new strategy, excepted those coming from BLD. Finally, to distinguish between SDL and CM, we determine the dissimilarity coefficient ξ for each pair of residues of each couple (Additional file 1: Table S1 represents this result for PR and Additional file 2: Table S2 for RT). All intermediate results, from the validity of statistical tests to the SDL determination are displayed in Table 1 . These results show that half of the interdependent couples come from a common ancestor. For the other half, only 50% involved SDL couples. To identify groups of positions that will become our future targets, the invariant positions located within 10 Å of a SDL couple were determined. The number of SDL is 10 times higher in the RT-T groups than in the other three groups. This result comes from the fact that the RT is 5 times longer than the PR, and because the sequences of the treated groups contain more mutations. Next, SDL couples and the invariant positions in their vicinity were gathered to form a graph. The subgraph positions of these graphs (in Table 2 ) represent the potential future targets. PR-NT and PR-T graphs ( Figure 3 and 4) contain two subgraphs, the RT-NT ( Figure 5 ) graph has three and RT-T ( Figure 6 ) graph, seven. Note that, subgraphs containing only two positions were excluded because they cannot form a realistic binding site.
Are these targets really druggable?
The groups of positions composing these subgraphs are predicted to be good targets to avoid resistance. However, to be of therapeutic interest, these targets should also be a good binding sites, i.e. pocket-shaped and composed of atoms that a small molecule can bind. As a first approximation, we tested this possibility by using the Q-siteFinder program  . From a three-dimensional PR structure chosen from the Protein data bank, Q-siteFinder determined 10 protein regions, which could form a binding pocket. We then kept the positions in the intersection between our subgraph results and Q-SiteFinder binding pockets. Table 2 lists the AA groups that fulfill the 7 conditions described at the beginning of this section. These groups therefore are candidate therapeutic targets forming predicted good binding sites with low potential to generate drug-resistance. We have highlighted two of these groups on the PR structure ( Figure 7A ). The first one, containing positions 12, 14, 19 (T1 in blue on Figure 7A , numbered in Table 2 ), has a site volume of 103 Å 3 and is common for patients treated and untreated patients. The second one containing positions 40, 42, 61 (T2 in red on Figure 7A , numbered in Table 2 ) with a site volume of 82 Å 3 , can only be used for untreated patients. Interestingly, studies of Bonhoeffer's  group on fitness landscape, described the same regions and defined them as characterized by strong epistasis. These regions have previously been described as being important for protein function  . The two best-scoring targets defined by the Q-SiteFinder software, correspond to the active site of the PR. The majority of drugs (not to say all) against this protein bind to its active site but, unfortunately, resistance against all these molecules have appeared. Besides, we did not find SDL in those areas. We have highlighted three of these groups on the RT structure ( Figure 7B ). The first one, containing the positions 13, 14, 15, 86, 17 (T3 in blue on Figure 7B , numbered in Table 2 ), has a site volume of 243 Å 3 and is common for treated and untreated patients. Of note, the position 86 disappears from the treated group. This target is localized in the RT fingers. The second one, localized in the thumb and containing positions 259, 262, 263, 266 (T4 in red on Figure 7B , numbered in Table 2 ) with a site volume of 375 Å 3 , only appears in the untreated set. The last one, involving positions 63, 64, 65, 66, 67, 70, 72 (T5 in yellow on Figure 7B , numbered in Table 2 ) with a site volume of 252 Å 3 and localized in the RT palm, is relevant for the treated set only. Interestingly, the second and third targets are involved in the DNA binding process.
Thus far, these results do not tell us anything about the nature of the AAs involved in these couples. Indeed, a given position can be involved in both CM and SDL relationships (concerning two different AA) with other positions and these relationships are interdependent. For this reason, we also compiled a list of the specific AAs involved in all the SDL and the CM couples, because they influence the general mutational landscape of the protein. All AA couples located at these positions and their dissimilarity coefficients ξ are listed in the Additional file 1: Table S1 for PR and Additional file 2: Table S2 for RT. In these tables, it appears very clearly that SDL and CM couples are not necessarily the same between treated and untreated patient sets. That is, couples can covary in one set and not in the other one (e. g. 45-46, 61-72, 63-72 for PR). Regarding the RT, the number of SL couples for the untreated set is much smaller than the number of couples in the treated set (Table 1) , which means that many couples are not common to both groups. Obviously, the potential druggable targets themselves are not the same in both groups of patients (Figure 3-6) . Thus, keeping this in mind, a potential drug can be able to block RT or PR in naïve patients, in treated patients or both. How to interpret the fact that the targets we describe lie outside the active sites? The residues constituting the protein active site are generally responsible for the chemical reaction allowing the enzymatic activity of the protein. However, the active site is not the only essential part of a protein as this function is carried by its threedimensional structure. Protease studies  show that its very flexible structure allows the flaps to open in order to accommodate its substrate. It is obvious that opening the flaps is an essential function for the enzymatic activity. It is therefore quite possible to block an essential function without docking a drug directly in its active site. The best examples are the existence of the non-nucleosidic reverse transcriptase inhibitors.
We would like to develop a software able to generate a table of interdependent residues and to sort out the best AA groups to uncover inescapable drug targets. Such a strategy can be applied to any protein, especially those from RNA viruses such as flu  , coronavirus  , hepatitis C virus  , provided that enough mutated sequences are available in the databases. These best interdependent AA groups could then be tested to assess whether their 3D arrangements form a druggable pocket at the protein surface. Q-SiteFinder allows a first approximation for pocket detection that will be enriched with studies that consider the flexible nature of the proteins, to discover the most suitable pockets. This technique allows the description of potential targets, which must be biologically validated, to prove they carry essential functions.
Viral fitness is one of the major aspects of the therapeutic escape along with variation and interdependence. Drugs increase the selection pressure and then alter the general mutational landscape of the target viral protein.
Indeed, several positions are mutated in the treated set, which generate/maintain viral drug resistance. These new mutations can have a drastic impact on the fitness of the virus, and several other positions could also mutate to maintain/increase the fitness of these newly mutated viruses. It could be interesting to create a sequence database, where each sequence would be associated with a viral fitness measure  , such as its average copy number in the blood. With this information in hand and based on the quasi-species theory principles  , it would be easy to test if the existence of SDL groups in a sequence can be correlated with a low fitness (i.e. a low copy number). Thus, we could show that to escape a drug, a virus will have to make mutations within SDL groups and to pay the price for, by decreasing its replication potential.
The choice of SDL and invariant positions as unique components of effective druggable targets has the ultimate aim of reducing or even eliminating drug-resistance.
Our results describe two new potential targets on PR and 3 on RT. We offer an unusual strategy, since these targets are not necessarily the same for the treated and untreated patients. The drug-induced selection pressures reveal new mutations that most often, reduce the fitness of the mutated organism. Variants that possess mutations enabling them to acquire better fitness, will now be selected. These two successive waves of mutations change the general equilibrium between CM and SDL in the two patient sets, leading to different drug development strategies. In the near future, it can be important to administer different molecules to naive (never treated) patients and to treated patients.
Sometimes a single mutation allows viruses to escape treatment. If this mutation appears on a SL position, no function will be lost. That is why in the description of our target we include the invariant positions, which mutated, prevent protein function. However, if this first mutation appears alone, we reach the limit of our strategy and resistance can develop. Our target will be unusable as it will be the equivalent of the targets described in the past. However, drug docking on targets consisting of invariant residues and SL pairs, is the best way to block viral resistance.
Wet biology can only describe an existing situation where residues appear to mutate concomitantly to induce resistance against a PI. Conversely wet biology cannot assess a situation where two residues are required to mutate together to induce resistance (but entailing the loss of an essential function). Indeed, this situation never appears. Here, we have focused on the kind of couples constituted by SDL and not by CM to describe new potential protein pockets that could be bound by potential drugs. If we were able to do so, HIV virus could not escape treatment without loosing an essential function. Additional file 1: Table S1 summarizes these interdependence relationships (i.e. a look-up table describing the exact AAs forming CM or SDL).
The method described in this manuscript is applied to HIV but can be used on any sequence dataset. In fact the only limitation is the total number of mutations per position. Indeed, in order to study the ability of two positions to mutate simultaneously or not, it is necessary to prove that each of these positions is variable. RNA viruses mutate approximately 100 times faster than most other organisms. This ability allows these species to be prime candidates for our method. However, since the number of sequenced genomes being constantly increased, it is almost certain that in the near future, this method will also be used to find new drugs against bacteria for which antibiotic resistance are becoming a major problem of public health.
Most drugs have been developed based on their ability to bind active sites. They can therefore bind the active sites of similar proteins and thus generate possible side effects. Our technique allows to target regions outside of the active sites, which might help define drugs with fewer side effects.
As already said, it will be necessary to experimentally validate these bioinformatic predictions. For this, it is important to prove that the targets are essential for protein function. This question could be addressed by studying how the mutation of the residues composing the targets will affect viral activity. Small molecules binding the target at the selected positions can be found using virtual high throughput screening of large chemical libraries. Potential leads emerging from these hits may be refined by structure-activity studies. Finally, inhibition of viral activity in the presence of these molecules should validate the quality of the inhibitor.
24656 PR sequences and 23052 RT sequences of HIV-1 subtype B, from non-treated patients were downloaded the 7 th of May 2013, from the Stanford University HIV drug resistance database  (http://hivdb.stanford.edu/). 10585 sequences, from patients treated with 1 to 9 PI were downloaded as well and 9784 RT sequences from patient with 1-7 NRTI and/or 1-4 NNRTI. The sequences of these two protein sets are full length i.e. containing the 99 positions of the PR, 560 positions for the RT.
In order to define the accessibility of the AAs to an external ligand (i.e. a potential drug), we computed the surface accessible to the solvent, using the ASA software  available at RPBS  , based on the 3D PR structure PDB ID:1HSG  and 3D RT structure PDB ID:1DLO  . All AAs having an accessibility threshold greater than 25% are considered "accessible".
Previous protein alignments were recoded to focus the mutated AA status relative to a reference sequence. Each AA was compared to the AA from the ancestral sequence in the same position. It is recoded in 1 if it is equivalent to the ancestral sequence, 0 otherwise and N if it is not defined. Only positions lying on the surface of the protein and variants (ie with more than 0.3% of mutated positions) have been taken into account.
A couple were defined as interdependent if 3 of the following 4 statistical tests.
1. The Fisher exact test of covariance coded in R was used to examine each variant position pairs of PR and RT. To overcome the bias caused by the large number of tests performed, the p-values were re-adjusted using a FDR method in R. After this adjustment, only p-values > 0.05 were retained. The pairs corresponding to these p-values are black on the heatmap of Figure 1 and numbered in the black area on figure 2 for RT. 2. The D' test measures the linkage disequilibrium [39, 13, 40] which is the non-random association calculation of two alleles at two loci. This D' test has been computed for all pairs of positions variants and accessible (using as input the recoding alignments according to Wang data's  ). The pairs corresponding to these p-values are the "non red" on the heatmap of Figure 1 and numbered in the red area on Figure 2 for RT. 3. r 2  is an index derived from the correlation index D Lewontin [39, 13, 40] . Using recoding alignments, this test r 2 has been computed in Perl according to (13, 32, 33) for all pairs of positions variants and accessible. The pairs corresponding to these p-values are black on the heatmap of Figure 1 and numbered in the green area on Figure 2 for RT. 4. This last test is a χ 2 ij that takes into account the true nature of AA and not just the fact that it is mutated or not. It is thus calculated from the protein alignment (not recoded) of the method according Noirvirt  . In these conditions, only couples expected more than 5 times were kept. Given a p-value of 0.05 in the sense of  , we calculated that 6% of the couples of positions that are detected using the random shuffling method are due to multiplicity (i. e. FDR) for the three sets. The pairs corresponding to these p-values are black on the heatmap of Figure 1 and numbered in the blue area on Figure 2 for RT.
Using DNA sequences, couples of non synonymous (A-A) and couples of synonymous mutations (S-S) were determined. A D' coefficient were then computed from these data as explained in [13, 14] .
When a couple was determined as interdependent, one can compute a signed dissimilarity coefficient ξ which is negative when the number of expected AA couple were superior of the number of observed couples (SDL pairs), otherwise it is a compensatory pairs (CM).
Furthermore this coefficient is here conventionally signed as follows:
If Nobs A,i,B,j ≥ Nex A,i,B,j then ξ A,i,B,j = + χ 2
Where "A" is a specific AA at position "i", "B" is a specific AA at position "j" and χ 2 A,i,B,j is computed as in  .
Reviewer's report 1 Based on amino acid sequence alignments from either treated or non-treated individuals, they identified amino acids that appear to be accessible and lethal or deleterious when simultaneously mutated (synthetic lethal, SL, or synthetic deleterious, SD, residues). The authors also identify apparently invariant PR and RT amino acids that are therefore assumed to be critical for molecular function. The central hypothesis being pursued is that drugs able to bind to such SL/SD pairs that are in sufficient proximity to one another, plus one or more amino acids identified as invariant, on the molecular surface would serve as relatively non-mutable target sites for inhibitory drugs. Success in their objective would be of obvious value in the efforts to minimize the spread of HIV and the management of infection in individuals already carrying HIV. In the present manuscript, the authors also demonstrate that exposure to treatment modifies the PR and RT mutational landscapes.
1. Given that the contents of the present manuscript have employed methods already described in a previous article (Brouillet et
The reviewers' comments of our first article enabled us to significantly change the method used. Indeed, our previous method does not solve three important points:
-Discrimination of pairs of residues functionally interdependent of those that are due to a common ancestor. To answer to reviewer 3 of the previous article, we used a 'D' Lewontin derivative test. This new test is used to compare the rates of synonymous and non synonymous mutations for pairs of positions. -Statistical studies were based on a single test. Three other tests were implemented (D' , r 2 , fisher). -The nature of the amino acids was not taken into account and only the Boolean result "mutated/ non-mutated" was calculated. New statistical tests now allow to define the exact nature of AAs forming interdependent couples.
Applications All findings concerning RT are new results since the first paper concerned only the PR that has served as a control in this new study. Two tables (RT, PR) describe the major pairs of mutations and the nature of the associated amino acids for the 4 sequence sets (Additional file 1: Tables S1 and Additional file 2: Table S2 ). These tables are essential for drug designers, chemists and chemoinformaticians.
Regarding the biological validation of these results: this is beyond the scope of the present paper but we are currently setting up a partnership with a HIV virology laboratory that will define the adequate experimental protocol and apply it.
Reviewer's comment 2. I am not confident that 100% of "invariant" residues are in fact critical for function. For example, a putatively invariant tryptophan residue at the start of the second framework region within all immunoglobulin heavy and light chain variable domains sequenced prior to the study by J. Sharon [J Immunol. 1988 Apr 15;140(8):2666-9] was found not to be critical for function. For an antibody of known antigen specificity, Sharon mutated the Trp to Ala by site-directed mutagenesis without apparent effect on antibody reactivity for antigen.
Reviewer 2: Csaba Pal, Biological Research Center, Hungary.
The main objective of the paper is to identify intragenic pairs of residues that show synthetic lethal interactions in HIV proteins. The authors use this information to uncover drug target sites that could potentially mitigate the evolution of resistance. The manuscript is well written and the presentation of ideas goes straight to the point. The strategy followed by the authors is, to my knowledge, innovative and a valid approach to try to overcome drug resistance during HIV therapy.
In fact, such approach, due to its target specificity and efficiency, would also be beneficial to the development of therapeutic approaches with less toxic side effects to the therapy of new-borns, infants and young children, which, together with multi-drug resistance, is an important problem to be solved in HIV therapy. The idea of creating a software tool for the identification of inescapable drug targets is very important, and could help medicinal chemists to focus their research on compounds that bind to the predicted target sites.
The authors should discuss the benefits and future perspectives of the work more deeply in the paper. For example, the possibility to apply such methodology to other target proteins should also briefly be discussed in the manuscript. The authors should also discuss in vitro/ in vivo validation of the reported results, including possible limitations of such studies.
It is reasonable that some drug, which fits to the PR-NT and RT-NT cases can not be used for PR-T and RT-T cases, but it is not clear why they are not usable the other way around.