Material and Methods DNA Samples and Genotyping The Kalash samples were collected from three valleys in the Hindu Kush mountain ranges in northwest Pakistan (Figure 1A). In accordance with the Declaration of Helsinki, the samples were collected after informed consent was obtained, and the study was approved by all relevant institutional ethics committees. Lymphoblastoid cell lines were established from all collected blood samples, and some (n = 25) were deposited with CEPH; these latter samples form part of the South and Central Asian collection of the HGDP-CEPH cell-line panel. We used 10 of these and an additional 13 samples that are not in the collection for our analysis. All of these unrelated (n = 23) Kalash males were genotyped on the Illumina HumanOmni2.5M-8 BeadChip with 200 ng of DNA (26 ng/μl) prepared from these lymphoblastoid cell lines.2 Genotyping calls and quality control (QC) were performed with GenoSNP9 by the Sanger Institute’s core genotyping facility. Genotypes were called only for samples passing Sequenom genetic fingerprinting and gender concordance. These were run through the standard QC pipeline. All 23 samples passed a call-rate threshold of 95% and were used in the downstream analysis. Genotyping quality was assessed by comparison of 178,072 SNPs that overlapped the Illumina 650,000 K SNP chip.5 Ten of the Kalash samples analyzed in this study had also been genotyped on this platform, and the sample genotype concordance was 99.999%. Comparative data were obtained from 35 populations representing Africa, Europe, Caucasus, and West, Central, East, and South Asia (Table S1).5,10–12 DNA Sequencing High-coverage (30×) 100-bp paired-end sequencing of one of the genotyped male samples was carried out on an Illumina HiSeq 2000 with 5 μg of lymphoblastoid cell line DNA, standard library preparation, and analysis pipelines developed for the 1000 Genomes Project.13 The sequenced reads were mapped to the human GRCh37 reference sequence (UCSC Human Genome Browser hg19) used by the project (human_g1k_v37.fasta.gz). Variant annotations were performed with the R package NCBI2R and Ensembl’s Variant Effect Predictor.14 There was high concordance (99.9%) between the variant calls from the high-coverage data and the same sample’s SNP-chip genotypes. Data Analysis The data were merged with reference-population data from African and non-African sources covering Eurasia (Table S1).5,10–12,15,16 The merged dataset was pruned for the removal of variants in high (r2 > 0.4) linkage disequilibrium (LD) and individuals with high identity by descent (IBD) (PLINK IBD score > 0.6) from the analysis. The high IBD threshold was chosen to account for the increased inbreeding levels introduced by the strong genetic drift experienced by the Kalash population. Principal-component analysis (PCA) was performed with EIGENSOFT v.5.01, and ancestry analysis was performed with ADMIXTURE v.1.22.17 Spatial kriging was used to quantify the spatial genetic heterogeneity by interpolating ancestry values (obtained from ADMIXTURE analysis) for each cluster; the principal-component eigenvectors were used as coordinates for each individual.18 Genetic structure and gene flow between populations was investigated via three different approaches: ALDER,19 three-population (f3) statistics,20 and TreeMix.21 We applied pairwise sequentially Markovian coalescent (PSMC) analysis to draw inferences about the long-term effective population sizes and times of divergence from ten high-coverage genomes, including the Kalash.22 The nine other genomes were sequenced by Complete Genomics and included three unrelated African populations (Yoruba in Ibadan, Nigeria [YRI]; Luhya in Webuye, Kenya [LWK]; Maasi in Kinyawa, Kenya [MKK]) and six non-African genomes from East Asia (Han Chinese in Beijing, China [CHB]; Japanese in Tokyo, Japan [JPT]), Europe (Utah residents with Northern and Western European ancestry from the CEPH collection [CEU]; Toscani in Italy [TSI]), South Asia (Gujarati Indian in Houston, Texas [GIH]) and America (Mexican Ancestry in Los Angeles, California [MXL]). Phasing was carried out with SHAPEIT223 and the 1000 Genomes Project reference panel.22 We also estimated effective population size and time of divergence between the Kalash and other populations by analyzing LD patterns in SNP-chip data with the NeON R package.24 We assessed the genetic relatedness of ancient genomes to modern populations by computing outgroup f3 statistics.20 In the absence of admixture with the outgroup, the expected value of f3 (outgroup; A, B) is a function of the shared genetic history of A and B. We used the YRI as an outgroup to non-African populations and computed f3 statistic (YRI; ancient, X) to investigate the shared history of an ancient genome and a set of 32 worldwide populations, including the Kalash (X), and f3 statistic (YRI; Kalash, Y) to investigate the Kalash and a set of 32 worldwide populations and ancient genomes (Y). Ancient genomes used in the analysis included the Mal’ta boy (MA-1), a Paleolithic Siberian hunter-gatherer;25 La Braña 1, a Mesolithic European hunter-gatherer from Iberia;26 and Ötzi, the Tyrolean Iceman and a representative European Neolithic farmer.27 BAM files of the ancient genomes were downloaded from the respective references and managed with Picard v.1.112, and the genotypes were called with the Genome Analysis Toolkit v.3.3.28 Data on DNA polymorphisms were stored in VCF files and managed with VCFtools.29 Selection Selection in the Kalash populations was estimated as described by Yi et al.30 In brief, we obtained estimates of the population time of divergence (T) from FST values and corrected for the effective population size of the population considered (Nep), as shown inT=log(1−FST)log(1−12×Nep). For each marker, we calculated T between the Kalash and CHB (TKC), between the Kalash and Balochi (TKB)—who were considered representative of the other Pakistani populations—and between the CHB and Balochi (TCB). The population branch statistic (PBS), which defines the length of the branch leading to the Kalash since the split from East Asians, is equal toPBS=TKC+TKB−TCB2. Nep estimates were obtained from the linkage data (2,471 [95% confidence interval (CI) = 2,319–2,603]) with the NeON R package. The variants within the 99th percentile of the distribution of our PBS values were annotated, and a list of genes associated with these variants was used for Ingenuity Pathway Analysis (IPA). Simulation We modeled the effect of drift and selection on specific variants by using the simuPOP library.31 We used the effective population size estimated from this study in the simulations and obtained the initial allele frequency for each marker from the observed data. We recorded the allele frequency every 50 generations for 500 generations (which roughly corresponds to 12,500 or 14,000 years ago if we assume a generation time of 25 or 28 years, respectively). Each scenario was replicated 1,000 times.