Subjects and Methods Human Subjects All participants were drawn from the customer base of 23andMe, Inc., a consumer personal genetics company. This data set has been described in detail previously.34,35 Participants provided informed consent and participated in the research online, under a protocol approved by the external AAHRFP-accredited IRB, Ethical & Independent Review Services (E&I Review). Genotyping Participants were genotyped as described previously.36 In short, DNA extraction and genotyping were performed on saliva samples by National Genetics Institute (NGI), a CLIA-licensed clinical laboratory and a subsidiary of Laboratory Corporation of America. Samples have been genotyped on one of four genotyping platforms. The V1 and V2 platforms were variants of the Illumina HumanHap550+ BeadChip, including about 25,000 custom SNPs selected by 23andMe, with a total of about 560,000 SNPs. The V3 platform was based on the Illumina OmniExpress+ BeadChip, with custom content to improve the overlap with our V2 array, with a total of about 950,000 SNPs. The V4 platform in current use is a fully custom array, including a lower redundancy subset of V2 and V3 SNPs with additional coverage of lower-frequency coding variation and about 570,000 SNPs. Samples that failed to reach 98.5% call rate were reanalyzed. Individuals whose analyses failed repeatedly were recontacted by 23andMe customer service to provide additional samples, as is done for all 23andMe customers. Customer genetic data have been previously utilized in association studies and studies of genetic relationships.34–43 Research Cohorts 23andMe customers were invited to fill out web-based questionnaires, including questions on ancestry and ethnicity, on state of birth, and current zip code of residence. They were also invited to allow their genetic data and survey responses to be used for research. Only data of customers who signed IRB-approved consent documents were included in our study. Survey introductions are explicit about their applications in research. For example, the ethnicity survey introduction text states that the survey responses will be used in ancestry-related research (Table S1 available online). Self-Reported Ancestry It is important to note that ancestry, ethnicity, identity, and race are complex labels that result both from visible traits, such as skin color, and from cultural, economic, geographical, and social factors.23,44 As a result, the precise terminology and labels used for describing self-identity can affect survey results, and care in choice of labels should be utilized. However, we chose to maximize our available self-reported ethnicity sample size by combining information from questions asking for customer self-reported ancestry. We used two survey questions, with different nomenclature, to gauge responses about identity, which here we view as “the subjective articulation of group membership and affinity.”45 The first question is modeled after the US census nomenclature and is a multiquestion survey that allows for choice of “Hispanic” or “Not Hispanic,” and participants were asked “Which of these US Census categories describe your racial identity? Please check all that apply” from the following list of ethnicities: “White,” “Black,” “American Indian,” “Asian,” “Native Hawaiian,” “Other,” “Not sure,” and “Other racial identity.” For inclusion into our European American cohort, individuals had to select “Not Hispanic” and “White,” but not any other identity. For inclusion into our Latino cohort, individuals had to select “Hispanic,” with no other restrictions. For inclusion into our African American cohort, individuals had to select “Not Hispanic” and “Black” and no other identity. The second question on identity is a single-choice question, where respondents were asked to choose “What best describes your ancestry/ethnicity?” from “African,” “African American,” “Central Asian,” “Declined,” “East Asian,” “European,” “Latino,” “Mideast,” “Multiple ancestries,” “Native American,” “Not sure,” “Other,” “Pacific Islander,” “South Asian,” and “Southeast Asian.” Because individuals could select only one response, we included individuals who selected “European” in our European American cohort, those who selected “African American” in our African American cohort, and those who selected “Latino” in our Latino cohort. Some African American participants included in this study were recruited through 23andMe’s Roots into the Future project (accessed October 2013), which aimed to increase understanding of how DNA plays a role in health and wellness, especially for diseases more common in the African American community. Individuals who self-identified as African American, black, or African were recruited through 23andMe’s current membership, at events, and via other recruitment channels. In the present work, we do not include individuals who self-report as having multiple identities, because this represents only a small fraction of individuals in our data set. Low rates of reporting as multiracial or multiethnic is in line with previous studies; an analysis of the 2000 US Census shows that 95 percent of blacks and 97 percent of whites acknowledge only a single identity.45 Future studies including multiracial individuals might further illuminate patterns of genetic ancestry and the complex relationship with self-identity. Differences among states, where different proportions of people self-report as mixed race, might explain some regional differences in genetic ancestry. However, we note that, first, proportionally fewer people identify as mixed race than as a single identity, and second, it remains important to establish regional differences in genetic ancestry of self-reported groups even if these differences are driven, to some degree, by regional changes in self-reported identity. More work is needed to determine to what extent regional differences are a result of how people today report their ancestry. Lastly, when available, we excluded individuals who answered “No” to a question whether they are living in the US. In total, our final sets included 5,269 African Americans, 8,663 Latinos, and 148,789 European Americans. Notes on Terminology and Selection of Populations Throughout the manuscript, the term “Native American ancestry” refers to estimates of genetic ancestry from indigenous Americans found across North, Central, and South America, and we distinguish this term from present-day Native Americans living in the US. We use the term “Native American” to refer to indigenous peoples of the Americas, acknowledging that some people may prefer other terms such as “American Indian.” Our estimates of African ancestry specifically aim to infer ancestry of sub-Saharan Africa and does not include ancestry from North Africa. We note that the term “Latino” has many meanings in different contexts, and in our case, we use it to refer to individuals living in the US who self-report as either “Latino” or “Hispanic.” Our work represents a snapshot in time of genetic ancestry and identity, and future work is needed to inform the dynamic changes and forces that shape social interactions. We note that our cohorts are likely to have ancestry from many African populations, but because of current reference sample availability, our resolution of West African ancestries is outside the scope of our study. Likewise, our estimates of Native American ancestry arise from a summary over many distinct subpopulations, but we are limited in scope because of insufficient sample sizes from subpopulations, so we currently use individuals from Central and South American together as a reference set (see Durand et al.33 for a list of populations and sample sizes). Validation of Self-Reported Identity Survey Results To verify that our self-reported ethnicities were reliable, we examined the consistency of ethnicity survey responses when individuals completed both ancestry and ethnicity surveys. Because the structure of the two surveys is different and multiple selections were allowed in one survey but not the other, we examined the replication rate of the primary ethnicity from the single-choice ethnicity survey in the multiple-selection survey. In addition to structural differences, the survey content used very different nomenclature, and therefore we believe our estimated error rates to be overestimates of the true error rate, because it is likely that some individuals choose to identify with one label but not the other (i.e., “African American” but not “black”). Discrepancies in the question nomenclatures are likely to increase the error rate. Furthermore, because the two surveys could be completed at different times, either before or after obtaining personal ancestry results, it is possible that viewing genetic ancestry results might have led to a change in self-reported ancestry. Such a change would be tallied as an error in our estimates, but instead reflects a true change in perceived self-identity over time. Overall, we expect that our survey data represent highly reliable ancestry information, with errors affecting fewer than 1% of survey responses. Geographic Location Collection Self-reported state-of-birth survey data was available for 47,473 customers of 23andMe. However, because overlap of these customers with our cohorts was poor, we also chose to include data from a question on current zip code of residence. This provided an additional 34,351 zip codes of current residence. In cases where both the zip code of residence and state of birth were available, we used state-of-birth information. To obtain state information from zip codes, we translated zip codes to their state locations via an online zip code database (accessed October 2013). In total, we had 50,697 individuals with available location information. About one third of each of our cohorts had location information: 1,970 African Americans, 2,944 Latinos, and 45,783 European Americans were used in our geographic analyses. Ancestry Analyses Ancestry Composition We apply Ancestry Composition, a three-step pipeline that efficiently and accurately identifies the ancestral origin of chromosomal segments in admixed individuals, which is described in Durand et al.33 We apply the method to genotype data that have been phased via a reimplementation of Beagle.46 Ancestry Composition applies a string kernel support vector machines classifier to assign ancestry labels to short local phased genomic regions, which are processed via an autoregressive pair hidden Markov model to simultaneously correct phasing errors and produce reconciled local ancestry estimates and confidence scores based on the initial assignment. Lastly, these confidence estimates are recalibrated by isotonic regression models. This results in both precision and recall estimates that are greater than 0.90 across many populations, and on a continental level, have rates of 0.982–0.994 for precision and recall rates of 0.935–0.993, depending on populations (see Table 1 from Durand et al.33). We note that here, and throughout the manuscript, African ancestry corresponds to sub-Saharan African ancestry (including West African, East African, Central, and South African populations, but excluding North African populations from the reference set). For more details on our ancestry estimation method, see Durand et al.33 Aggregating Local Ancestry Information 23andMe’s Ancestry Composition method provides estimates of ancestry proportions for several worldwide populations at each window of the genome. To estimate genome-wide ancestry proportions of European, African, and Native American ancestry, we aggregate over populations to estimate the total likelihood of each population, and with a majority threshold of 0.51, if any window has a majority of a continental ancestry, we include it in the calculation of genome-wide ancestry, which is estimated as the number of windows passing the threshold for each ancestry over the total number of windows. Some windows might not pass our threshold for any population, so they remain unassigned, making it possible for estimates for all ancestries to not sum to 100%, resulting in population averages that likewise might not sum to 100%. We allow for this unspecified ancestry to reduce the error rates of our assignments, so, in some sense, our estimates might be viewed as lower bounds on ancestry, and it is possible that individuals carry more ancestry than estimated. In practice, we typically assign nearly all windows, with an average of about 1%–2% unassigned ancestry, so we do not expect it to affect our results, with the exception of Native American ancestry, which we discuss below. Generating the Distribution of Ancestry Tracts We generate ancestry segments as defined as continuous blocks of ancestry, estimating the best guess of ancestry at each window to define segments of each ancestry. Assigning the most likely ancestry at each window results in fewer spurious ancestry breaks and allows for a smaller upward bias in admixture dates, because breaks in ancestry segments push estimates of dates further back in time. We measure segment lengths by using genetic distances, by mapping segment start and end physical positions to the HapMap genetic map. Admixture Dating To estimate the time frame of admixture events, we test a simple two-event, three-population admixture model via TRACTS.47 We use a grid-search optimization to find four optimal parameters for the times of two admixture events and the proportions of admixture. We are limited to simple admixture models resulting from the computationally intensive grid search, because we were unable to obtain likelihood convergence with any of the built-in optimizers. The model tested is as follows: two populations admix t1 generations ago, with proportion frac1 and 1 − frac1, respectively. A third population later mixes in t2 generations ago, with proportion frac2. Both our ancestry segments and prior results supported a model with an earlier date of Native American admixture.25,47 We estimated likelihoods over plausible grid of admixture times and fractions for African Americans, Latinos, and European Americans to estimate dates of initial Native American and European admixture and subsequent African admixture. These dates are estimated as the best fit for a pulse admixture event: because they represent an average over more continuous or multiple migrations, initial admixture is likely to have begun earlier. Lower Estimates of African Ancestry in 23andMe African Americans Unlike previous estimates of the mean proportion of African ancestry, which typically have ranged from 77% to 93% African ancestry,2–4,48–62 our estimates, depending on exclusions, are 73% or 75%. There are several possible explanations for our low mean African ancestry. If our Ancestry Composition estimates are downward biased, then the African Americans might have levels of African ancestry consistent with other studies, and our results are simply underestimates. However, our Ancestry Composition estimates are extremely well calibrated for African Americans from the 1000 Genomes Project and their consensus estimates, and we see no evidence of a downward bias (see Figure 5 from Durand et al.33). The mean ancestry proportion of 23andMe self-reported African Americans is about 73%. A small fraction, about 2%, of African Americans carry less than 2% African ancestry, which is far less than typically seen in most African Americans (Figure S18A available online). Further investigation reveals that the majority of these individuals (88%) have predominantly European ancestry, and others carry East Asian, South Asian, and Southeast Asian ancestry, roughly in proportion to the frequencies found in the 23andMe database overall. Given the large number of non-African American individuals in the 23andMe database, even an exceeding low survey error rate of 0.02% could be sufficient to account for the number of outlier individuals we detect. Hence, we posit that these individuals represent survey errors rather than true self-reported African Americans. Exclusion of these 108 self-reported African Americans with less than 2% African ancestry from mean ancestry calculations results in a moderate rise, to 74.8%, of the mean proportion of African ancestry in African Americans. To quantify differences in African ancestry driving mean state differences, we examined the distributions of estimates of African ancestry in African Americans from the District of Columbia (D.C.) and Georgia, which had at least 50 individuals with the lowest and highest mean African ancestry proportions (Figure S1E). We find a qualitative shift in the two distributions of African ancestry, with D.C. showing a reduced mode, higher variance, and a heavier lower tail of African ancestry, corresponding to more African Americans with below-average ancestry than Georgia. Qualitative differences in the distributions of African ancestry proportions in African Americans from states with higher and lower mean ancestry appear to be driven by both a shift in the mode of the distribution as well as a heavier left tail reflecting more individuals with a minority of African ancestry (Figure S1). We posit that differences among states could be due to differences in admixture, differences in self-identity, or differences in patterns of assortative mating, whereby individuals with similar ancestry might preferentially mate. For example, greater levels of admixture with Europeans would both shift the mode and result in more African American individuals who have a minority of African ancestry. Alternatively, a shift toward African American self-identity for individuals with a majority of European ancestry (possibly because of changes in cultural or social forces) would likewise result in lower estimates of mean African ancestry. Lastly, assortative mating would work to maintain or increase the variance in ancestry proportions, though assortative mating alone could not shift the mean proportion of African ancestry in a population. Sex Bias in Ancestry Contributions Sex bias in ancestry contributions, often assessed through ancestry of mtDNA and Y chromosome haplogroups, is also manifested in unequal estimates of ancestry proportions on the X chromosome, which has an inheritance pattern that differs between males and females. The X chromosome more closely follows female ancestry contributions because males contribute half as many X chromosomes. Comparing ancestry on the X chromosome to the autosomal ancestry allows us to infer whether that ancestry historically entered via males (lower X ancestry) or by females (higher X ancestry). Under equal ancestral contributions from both males and females, the X chromosome should show the same levels of admixture as the genome-wide estimates. To look for evidence of unequal male and female ancestry contributions in our cohorts, we examined ancestry on the X chromosome (NRY region), which follows a different pattern of inheritance from the autosomes. In particular, estimates of ancestry on the X chromosome have been shown to have higher African ancestry in African Americans.9 We calculate ancestry on the X chromosome as the estimate of ancestry on just windows on the X, and we compare to genome-wide estimates (which do themselves include the X chromosome). It should be noted that these calculations differ among males and females, because the X chromosome is diploid in females and thus has twice as many windows in calculation of genome-wide mean proportions. However, our results still allow a peek into sex bias because the overall contribution of the X chromosome to the genome-wide estimates is small. We note that because our ancestry estimation method conservatively assigns Native American ancestry, we expect that much of the remaining unassigned ancestry might be due to Native American ancestry assigned as broadly East Asian/Native American, which is not included in these values (see Figure 5 in Durand et al.33). To infer estimates of male and female contributions from each ancestral population, we estimated the male and female fractions of ancestry that total the genome-wide estimates and minimize the mean square error of the X chromosome ancestry estimates. We assume that overall male and female contributions are each 50% (∑popfpop,male=0.5 and ∑popfpop,female=0.5). We assume that the total contribution from males and females of a population gives rise to the autosomal ancestry fraction (fpop,male + fpop,female = autopop). We then compute, via a grid search, the predicted X chromosome estimates from fpop,male, fpop,female for each pop∈{African,NativeAmerican,European}, which are calculated, as in Lind et al.,6 asXˆpop=fpop,male+2⋅fpop,female0.5⋅1+0.5⋅2=fpop,male+2⋅fpop,female1.5 We choose the parameters of male and female contributions that minimize the mean squared error of the X ancestry estimates and the predicted Xˆpop. These are the estimates of male and female ancestry fractions under a single simplistic population mixture event that best fit our X chromosome ancestry estimates observed. Population Size Correlations From the 2010 Census Brief “The Black Population” available online, we calculated the correlation between the number of reported African Americans living in a state and our sample of African Americans from that state. The correlation is strong, with p value of 9.5 × 10−14, suggesting that our low sample sizes from states in the US Mountain West is expected from estimates of population sizes. African ancestry in European Americans most frequently occurs in individuals from states with high proportions of African Americans and is rare in states with few African Americans. This observation led us to look at the correlation between population size (as a percent of state population using self-reported ethnicity from the 2010 US Census) and state mean levels of ancestry. To examine the interaction between proportions of minorities and ancestry, we used the 2010 US Census demographic survey by state. We compare the state population proportion to the mean estimated admixture proportion of individuals from that state, fitting linear regressions, and generating figures with geom_smooth(method = “lm,” formula = y ∼ x) from the ggplot2 package in R. We find that African ancestry in European Americans is strongly correlated with the population proportion of African Americans in each state. We find that the higher the state proportion of African Americans, the more African ancestry is found in European Americans from that state, reflecting the complex interaction of genetic ancestry, historical admixture, culture, and self-identified ancestry. Logistic Regression Modeling of Self-Identity We examine the probabilistic relationship between self-identity and genetically inferred ancestry. To explore the interaction between genetic ancestry and self-reported identity, we estimated the proportion of individuals that identify as African American and European American, partitioned by levels of African ancestry. Jointly considering the cohorts of European Americans and African Americans, we examined the relationship between an individual’s genome-wide African ancestry proportion and whether they self-report as European American or African American. We note a strong dependence on the amount of African ancestry, with individuals carrying less than 20% African ancestry identifying largely as European American, and those with greater than 50% reporting as African American. To test the significance of this relationship, we fit a logistic regression model, using Python’s statsmodels package, predicting self-reported ancestry by using proportion African ancestry, sex, age, intercept, and interaction variables. Validation of Non-European Ancestry in African Americans and European Americans Although our Ancestry Composition estimates are well calibrated and have been shown to accurately estimate African, European, and Native American ancestry in tests of precision and recall,33 we were concerned that low levels of non-European ancestry in European Americans that we detected might represent an artifact of Ancestry Composition. Hence, we pursued several lines of investigation to provide evidence that estimates of African and Native American ancestry in European Americans are robust and not artifacts. Comparison with 1000 Genomes Project Consensus Estimates Comparisons of our estimates with those published by the 1000 Genomes Consortium show the high consistency across populations and individuals. We compare estimates across Americans of African Ancestry in SW USA (ASW), Colombians from Medellin, Colombia (CLM), Mexican Ancestry from Los Angeles USA (MXL), and Puerto Ricans from Puerto Rico (PUR). We note that our estimates of Native American ancestry are conservative. Indeed, when our Ancestry Composition assignment probabilities do not pass over the confidence threshold, including signals of Native American ancestry together with general East Asian/Native American ancestry (but not East Asian) recapitulates estimates from the 1000 Genomes Project consensus estimates. Five individuals from the ASW population from the 1000 Genomes Project have poor consistency in their estimates. These individuals have a large amount of Native American ancestry that was not modeled by the 1000 Genomes Project estimates. That these particular individuals were sampled in Oklahoma, and carry significant Native American ancestry, is supported by our own high estimates of Native American ancestry in 23andMe self-reported African Americans from Oklahoma. Estimates of African and Native American Ancestry in Europeans We looked at whether all individuals who are expected to carry solely European ancestry also have similar rates of detection of non-European ancestry. To this end, we generated a cohort of 15,289 customers of 23andMe who reported that all four of their grandparents were born in the same European country. The use of four-grandparent birth-country has been utilized as a proxy for assessing ancestry.27,63 We then examined Ancestry Composition results for these individuals and calculated at what rate we detected at least 1% African and at least 1% Native American ancestry. Independent Validation of African Ancestry in European Americans via f4 Statistics We used f4 statistics from the ADMIXTOOLS software package to confirm the presence of African ancestry.64 We used the f4 ratio test, designed to estimate the proportion of admixture from a related ancestral population, to compare admixture in European Americans versus reference European individuals. We tested whether European Americans with estimated African ancestry showed any admixture from Africans by using our cohorts of individuals with estimated African ancestry and reference populations from the 1000 Genomes Project data set. Admixture would be expected to result in estimates of α significantly different from 1. Detection of Native American mtDNA in European Americans and African Americans The mitochondrial DNA (mtDNA) haplogroups A2, B2, B4b, C1b, C1c, C1d, and D1 are most prevalently found in the Americas and are likely to be Native-American-specific haplogroups because they are rarely found outside of the Americas. We assessed the fraction of individuals that carry these haplogroups to validate the likelihood of Native American ancestry in European Americans and African Americans and show that these haplogroups are virtually absent in European controls. Because mtDNA haplogroups are assigned by classification with SNPs that segregate on these lineages, these orthogonal results provide an independent line of support for our estimated Native American ancestry in European Americans and African Americans. Distribution of Ancestry Segment Start Positions Regions of the genome that have structural variation or show strong linkage disequilibrium (LD) have been shown both to confound admixture mapping and to influence the detection of population substructure in studies using Principal Components Analysis (PCA).27,63,65 If such regions were to drive artifacts of spurious ancestry, we would expect that segments of local ancestry would probably occur around these regions, rather than in a uniform distribution across the genome. To this end, we examined the starting positions of all African and Native American ancestry segments in European Americans and Native American ancestry in African Americans. Comparison with ADMIXTURE Genome-wide Estimates We applied ADMIXTURE,66 a model-based estimation of ancestry proportions, to estimate proportions of European, Native American, East Asian, sub-Saharan African, Middle Eastern, and Oceanian ancestry proportions. We use the supervised algorithm for K = 6, with 9,694 reference individuals representing the six aforementioned populations. We ran ADMIXTURE on 269,229 autosomal markers after pruning SNPs to have r2 < 0.5, via PLINK.67 To reduce computation time, we examined consistency of methods on the African Americans whom we estimated to have at least 1% Native American ancestry, European Americans estimated to have at least 1% Native American ancestry, and European Americans estimated to have at least 1% African ancestry.