Identifying CNVs under Selection Using a Bayesian Method

Identifying selection
Identifying candidate CNVs under natural selection using locus-specific estimates of FST is of great interest, in view of our increased knowledge of the relationships among human populations from genome-wide patterns of variation. The logic for identifying selection is straightforward. The pattern of genetic differentiation at a neutral locus is completely determined by the demographic history, migration rates among the populations, and the mutation rates at the loci. It is reasonable to assume that all autosomal loci have experienced the same demographic history and migration rates among the populations, and the observed population structure can be largely explained by random drift at neutral loci. However, as the individuals from different populations often vary genetically at a few key sites in their genome, loci showing unusually large amounts of differentiation may indicate regions of the genome that have been subject to positive selection, whereas loci showing unusually small amounts of differentiation may indicate regions of the genome that have been subject to stabilizing (balancing) selection [15, 22-24]. Thus, the outlier method makes it possible to detect divergences in some loci of the genome due to selection.

Bayesian method
Balding and Nichols have proposed the use of population- and locus-specific estimates in the context of a migration-drift equilibrium model [24]. They modeled allele frequencies of biallelic markers using a beta distribution with expectation p and variance p(1-p)/(1+θ) so that FST = 1/(1+θ). Then, they used a likelihood-based approach to estimate population- and locus-specific FST.
This formulation was extended later to consider multiallelic loci [19]. Falush et al. [25] considered the nonequilibrium fission model that subpopulations evolve in isolation after splitting from an ancestral population. We will not distinguish the migration-drift equilibrium model and nonequilibrium fission model, since both lead to the same multinomial-Dirichlet distribution (or beta-binomial for biallelic loci) for the subpopulation allele frequencies [26]. The main difference between the two models resides in the interpretation given to FST: in the case of the migration-drift equilibrium model, FST measures how divergent each subpopulation is from the total population, while in the case of the fission model, it measures the degree of genetic differentiation between each descendant population and the ancestral population.
We consider a collection of J subpopulations and a set of I loci. Let Ki> be the number of alleles at the i th locus. The extent of differentiation between subpopulation j and the ancestral population at locus i is measured by . Let pi = { pik } denote the allele frequencies of the ancestral population at locus i, where pik is the frequency of the allele k at locus i( = 1). We use p = { pi } to denote the entire set of allele frequencies of the ancestral population and  to denote the current allele frequencies at locus i for subpopulation j. Under the model and the definitions above, the allele frequencies at locus i in subpopulation j follow a Dirichlet distribution with parameters θjpi, a Bayesian prior distribution,
(3)    
where θij = 1/-1. The extent of differentiation at locus i between subpopulation j and the ancestral population is measured by  and is the result of its demographic history. The full prior distribution across loci and populations is given by
(4)    
We need a hierarchical model for locus- and population-specific effects to identify candidate loci under natural selection. It is no doubt that population- and locus-specific estimates of  by the moments (or ANOVA) methods are likely to be inaccurate, especially for loci with a small number of different alleles. As an alternative, Beaumont and Balding [22] proposed a familiar logistic regression model to decompose population- and locus-specific  coefficients (bounded between 0 and 1) into a population-specific component, βj, shared by all loci and a locus-specific component, αi, shared by all populations [22]:
(5)    
The locus-specific effects express mutation and some forms of selection, and population-specific effects express migration rates, population sizes, and population-specific mating patterns [22]. The advantage of this formulation is that instead of estimating locus- and population-specific I·J  coefficients (as in the method of moments or ANOVA), we only have to estimate I the parameters αi and the J parameters βj. With the estimates of αi and βj,  (and equivalently θij = exp(-(αi + βj)) can be estimated.
Above all, the estimate of αi is our objective to detect outlier loci as selection candidates. Departure from neutrality at a given locus is assumed when αi is significantly different from 0 at that locus. A positive value of αi suggests diversifying selection, whereas negative values suggest balancing selection. The posterior probability that a locus is subject to selection, P(αi ≠ 0), is estimated directly from the Markov Chain Monte Carlo method (MCMC) (see below).

Likelihood for allele counts
The data of codominant markers consist of allele counts obtained from samples of size nij. We use αijk to denote the number of alleles k observed at locus i in the sample from subpopulation j. Thus, nij = Σkαijk. The full dataset can be presented as a matrix N = { aij }, where aij = { αij1, αij2, ..., αijKi } is the allele count at locus i for subpopulation j. The observed allele frequencies, aij, can be considered as sampled from the true alleles frequencies  and therefore can be described by the multinomial distribution:
(6)    
We adopt the multinomial-Dirichlet distribution as a marginal distribution of aij, the allele frequency counts at a locus within a subpopulation, and the reasons of its adoption are explained in Foll and Gaggiotti [26]:

The joint likelihood for allele counts is obtained by multiplying across all loci and populations:
(7)    
Since the allele frequencies in the ancestral population pi = { pik } are unknown, they are estimated by introducing a noninformative Dirichlet prior, pi ~ Dir (1, … , 1), into the Bayesian model [27].
The full Bayesian model is given by
(8)    
Following Beaumont and Balding [22], the Gaussian prior distributions are used for the population effects βj and for the locus effects αi. Their means and variances are chosen to improve convergence and for each  to have non-negligible density over almost the whole interval from 0 to 1.

Identifying CNVs subject to selection
As described above, the effect of selection is parameterized by αi in the logistic regression. Thus, two models in the logistic, one that includes both effects of αi (i.e., αi ≠ 0) and βj (selection model, M2) and another one that does not include the effect of selection (i.e., αi = 0) and only includes the effect of βj (neutral model, M1), are considered.
We can decide in a Bayesian framework whether or not there is selection at all. This is done by estimating the posterior probabilities for two models for each locus in the logistic regression: neutral (M1) and selection (model 2, M2), on the basis of a dataset N = { aij }. We use a reversible jump MCMC algorithm [28] to estimate the posterior probability of each one of these models, and this is done separately for each locus i. We can then have posterior probability that a locus is subject to selection - that is, P(αi ≠0). This probability is estimated directly from the output of the MCMC by simply counting the number of times αi is included in the model (see Foll and Gaggiotti [27] for a more detailed explanation).
In BayeScan, selection decision (model choice decision), with a specific goal of detecting outlier loci in mind, is actually performed by using posterior odds (PO). For this, we first need to describe 'Bayes factors.' For the two models of neutral (M1) and selection (M2), the Bayes factor (BF) for model M2 is given by BF = P(N | M2)/P(N | M1), where N = { aij } is a dataset and P indicates probability. This BF provides a degree of evidence in favor of one model versus another. In the context of multiple testing - that is, testing a large number of loci simultaneously - we also need to incorporate our skepticism about the chance that each locus is under selection. This is done by setting the prior odds for the neutral model P(M1)/P(M2). We make selection decisions by using PO, which is defined as PO = P(M2|N)/P(M1|N) = BF × P(M2)/P(M1). PO are simply the ratio of posterior probabilities and indicate how much more likely the model with selection (M2) is compared to the neutral model (M1) (GENEPOP software). PO of 0.5-1.0, 1.0-1.5, 1.5-2.0, and 2-∞ are, respectively, considered as substantial, strong, very strong, and decisive evidence for selection [14].
A big advantage of posterior probabilities is that they directly allow the control of the false discovery rate (FDR), where FDR is defined as the expected proportion of false positives among outlier loci. Controlling for FDR has a much greater power than controlling for family-wise error rates using Bonferroni correction, for example [29]. In BayeScan, by first setting a target FDR, the PO threshold achieving this FDR is determined, and outliers equal to or above this PO threshold are listed in the output of R that is provided along with BayeScan.