Method-of-moment (or ANOVA) estimates of FST F-statistics Wright [12] introduced F-statistics (FST, FIT, and FIS) as a tool for describing the partitioning of genetic diversity within and among populations that are directly related to the rates of evolutionary processes, such as migration, mutation, and drift. Specifically, F-statistics can be defined in many different ways: in terms of variances of allele frequencies, correlations between random gametes, and probabilities that two gametes chosen have different alleles. Depending on the relativity to the subpopulation or to the total population, FST, FIT, and FIS are defined, where subscript IS refers to 'individuals within subpopulations,' ST to 'subpopulations within the total population,' and IT to 'individuals within the total population.' Following the work of Cockerham [16], F-statistics are defined in terms of the variance components - that is, the total variation in the genetic data is broken down into three components: (a) between subpopulations within the total population (we sometimes say 'between populations'); (b) between individuals within subpopulations; and (c) between gametes within individuals. FST, FIT, and FIS are defined as the expectations under the model of a/(a + b + c), (a + b)/(a + b + c), and b/(b + c) and estimated by the corresponding sample values [17, 18]. Here, it is perhaps pertinent to mention that when Weir and Cockerham [18] presented these definitions, they assumed a model consisting of an ancestral population from which subpopulations have descended in isolation under the same evolutionary processes. Thus, it is meaningful to have a single measure of population structure; that is, a global FST, which is an average over subpopulations. However, in identifying candidate loci under natural selection, evidence for locus-specific selection is of interest, and thus the estimators of locus-specific FST will be described in the next section. Often, readers may be confused with several terms appearing in population genetics. Wright [12] interprets FST as a measure of the progress of the subpopulation towards the fixation of one allele of each locus in the absence of mutation and hence called a 'fixation index.' FST is also interpreted as a measure of shared ancestry with the subpopulations, relative to that in the population, and is thus called the 'coancestry coefficient' [19]. Therefore, if the value of FST is small, it means that the allele frequencies within each subpopulation are similar; if it is large, it means that the allele frequencies of subpopulations are different. On the other hand, FIS or FIT is defined as the correlation between two gametes that form a zygote relative to the subpopulation or population, and thus, FIS (or FIT) is called the 'inbreeding coefficient' [19]. Estimating FST by ANOVA methods The estimators of F-statistics proposed by Weir [17] and Weir and Cockerham [18] are based on an analysis of variance (ANOVA) of allele frequencies, equivalently called the method-of-moments estimates. The weighted ANOVA estimates of FST, FIT, and FIS may be expressed in terms of the mean sum of squares for gametes (MSG), individuals (MSI), and populations (we sometimes say 'between subpopulations') (MSP), where the mean squares are estimated by an ANOVA model. In estimating FST specifically for our analysis of CNV data, we need to consider unbalanced samples (i.e., populations of unequal size). However, as the formulas are messy, we present here those for balanced samples. Formulas for unbalanced samples can be found in Rousset (in Appendix A) [20]. The definition of F-statistics used here is where Q values are probabilities of identity in state: Q1 among the genes (gametes) within individuals, Q2 among genes in different individuals within populations, and Q3 among the populations. The estimates are expressed in terms of observed frequencies of identical pairs of genes in the sample, with the following relationships: and where n is the sample size of each population. Then, the single locus estimator is given by (1) which is found in Weir (1997: 178) [17]. nc will be defined below. If one needs to obtain the multilocus estimator of , it is usual to compute the estimator as a sum of locus-specific numerators over a sum of locus-specific denominators (see Weir [17] and Weir and Cockerham [18]). This is the case that map information for SNPs is obtained for each gene, and a weighted-average FST from all SNPs is estimated for each gene [18]. For a set of I loci, the multilocus ANOVA estimators are (2) for nc = (S1-S2/S1)/(n-1), where S1 is the total sample size and S2 is the sum of squared sample sizes of populations [21]. For convenience, we denote the estimator by FST. Rousset [21] explained that the multilocus estimators of Weir [17] and Weir and Cockerham [18] differ slightly, and these two also differ slightly from that proposed by Rousset [21], which assigns more weight to larger samples. In this paper, the GENEPOP software (version 3.4) (http://wbiomed.curtin.edu.au/genepop/) of Rousset was used for the calculation of FST. In order to distinguish from those of the method-of-moments estimates of Weir [17] and Weir and Cockerham [18], we will call the estimates of GENEPOP ANOVA estimates. The estimated values of FST can be negative when levels of differentiation are close to zero and/or sample sizes are small, indicating no population differentiation at these loci [18]. One can assign a value of zero to negative FST estimates.