Subjects and Methods Subjects We recruited participants through the Childhood Overgrowth (COG) Study, which began recruitment in 2005, approved by the London Multicenter Ethics Committee (05/MRE02/17). Informed consent was obtained from all participants and/or parents, as appropriate. Individuals were eligible for this study if they had height and/or head circumference at least two standard deviations above the mean (≥+2 SD, UK90 growth data)13 at some point in childhood, together with intellectual disability. We have termed this condition OGID (overgrowth + intellectual disability). Overgrowth phenotypes that are not associated with intellectual disability, such as Beckwith Wiedemann syndome (MIM: 130650) or Marfan syndrome (MIM: 154700), were not included. Regional or asymmetric overgrowth phenotypes (e.g., hemihypertrophy) in the absence of increased height or head circumference were not included. 710 individuals with OGID were included. 97% (693) were recruited to the study from clinical genetics departments. For 323 individuals, samples from both parents were also available and included. 205 probands had both height and head circumference ≥+2 SD, termed “head+height” in Table S1. 138 had height ≥+2 SD with OFC <2 SD, termed “height only” and 109 had OFC ≥+2 SD and height <2 SD, termed “head only.” For the remaining 258 individuals, the child was recruited to the study because they had overgrowth, but measurements for both height and head were not provided. The overgrowth category is termed “unspecified” for these case subjects in Table S1. Intellectual disability was classified by the referring clinician as severe (77 case subjects), moderate (228 case subjects), or mild (229 case subjects). The referrer did not state the severity of the OGID for 176 individuals (termed “unspecified” in Table S1). Control Data We used the Exome Aggregation Consortium (ExAC) data v.3 accessed on 13/11/2015 (excluding the TCGA samples)14 and the ICR1000 UK exome series15 as reference data. We generated and analyzed the ICR1000 UK exome series data using the same sequencing and analysis pipeline described for the OGID samples. Targeted Gene Analyses We previously reported mutations in NSD1, EZH2, DNMT3A (MIM: 602769), and PPP2R5D (MIM: 601646) in 198 case subjects. The relevant references are in Table S1. Intragenic mutations in these genes were detected with Sanger sequencing. NSD1 is unusual among the 14 OGID genes included in this study in being prone to deletion by a 2 Mb 5q35 microdeletion, mediated by flanking low-copy repeats.16 We used MLPA to identify 5q35 microdeletions encompassing NSD1.17 NSD1 MLPA is also capable of detecting exon CNVs that account for ∼5% of NSD1 mutations.17 Microdeletions and exon CNVs in the other genes were not sought, but are unlikely to be a major contributor because the surrounding sequence architecture and/or mechanism of pathogenicity make it much less likely that such events will cause OGID. Exome Sequencing We performed exome sequencing in all probands in whom no mutation had been identified by targeted gene analyses and in parental samples where available. We performed exome sequencing using the Nextera Rapid Capture Exome Kit (Illumina). We prepared libraries from 50 ng genomic DNA using the Nextera DNA Sample Preparation Kit (Illumina). On average 33M reads mapped to the pulldown and 86% of targeted bases had ≥15× coverage. The captured libraries were PCR amplified using the supplied paired-end PCR primers. Exome sequencing in 57 samples was performed before the Nextera Exome Kit was available using the TruSeq Exome Enrichment Kit, which includes the 14 genes involved in OGID. When converting our exome pipeline from TruSeq to Nextera, we undertook in-house evaluation and validation to ensure that the performance was equivalent. Sequencing was performed on an llumina HiSeq 2000 or HiSeq 2500 (high output mode) using v3 chemistry and generating 2 × 101 bp reads. Variant Calling We used the OpEx v1.0 pipeline to perform variant calling.18 We converted raw data to FASTQs using CASAVA v.1.8.2 with default settings. The OpEx v1.0 pipeline uses Stampy19 to map to the human reference genome, Picard to flag duplicates, Platypus20 to call variants, and CAVA21 to provide consistent annotation of variants with the HGVS-compliant CSN (Clinical Sequencing Notation) standard v1.0.21 The transcript information for variant annotation for the 14 relevant genes are given in Table 1. Variant Prioritization and Validation We excluded variants with MAF > 0.5% in either the Exome Aggregation Consortium (ExAC) and/or the ICR1000 UK exome series. For the de novo analyses, we identified and validated any high-quality (as defined by OpEx18) variant in the child that was not present in either parent. We evaluated and validated all rare variants identified in the 14 genes. We confirmed all small variants in Table S1 that were called in exomes via Sanger sequencing of M13-tagged PCR products generated from genomic DNA. We performed PCR using the QIAGEN Multiplex PCR Kit according to the manufacturer’s instructions. We sequenced PCR products using M13 sequencing primers, the BigDye Terminator Cycle Sequencing Kit, and an ABI 3730 Genetic Analyzer (Applied Biosystems). We analyzed sequences using Mutation Surveyor software v.3.20 (SoftGenetics) and verified the outputs by manual inspection by two individuals, independently. Pathogenic Mutation Determination Apart from HIST1H1E (MIM: 142220), we considered a variant in the other 13 genes to be pathogenic if it fulfilled one or more of the following criteria. (1) It was a de novo mutation in a gene for which such de novo mutations were already proven to cause OGID. (2) The inheritance was unknown, because parental samples were unavailable, but it had been previously identified as a pathogenic de novo mutation in OGID. (3) It was a protein-truncating variant ([PTV] frameshifting indels, stop-gain, or essential splice-site variants) in a gene in which truncating mutations have been proven to be pathogenic. (4) There was clear evidence from the literature that it was pathogenic. The evidence for HIST1H1E mutations being pathogenic is provided in the Results. HIST1H1E Statistical Analyses We used the methods described in the DDD study22 to calculate the probability of identifying four de novo frameshift mutations in HIST1H1E using the gene-specific mutation rates from Samocha et al.23 The frameshift mutation rate in HIST1H1E (4.18 × 10−7) was multiplied by twice the number of case subjects in this study (710) in order to get the expected number of frameshift mutations. We calculated the probability of observing four or more de novo frameshift mutations in HIST1H1E given the expected number of frameshift mutations via the ppois function in R. We modeled the significance of mutation clustering in HIST1H1E under a binomial distribution where the probability of observing a mutation in a 12 bp region, which comprises 1.8% of the coding sequence, was 0.018. Protein Net Charge Calculation We obtained wild-type HIST1H1E cDNA (frame 1) sequence from Ensembl (ENST00000304218.5). We generated the HIST1H1E cDNA sequences edited with OGID mutations (frame 2). We used the variant c.430delG to generate the other possible alternative reading frame in HIST1H1E (frame 3). We translated the cDNA sequences using the Translate Tool at ExPASy. We calculated the net charge of the carboxy-terminal domain, from p.Lys110 onward, at neutral pH using the Peptide Property Calculator at the Innovagen website. Functional Network Analyses We performed functional enrichment analysis using g:Profiler (v.r1665_e85_eg32).24 We used the 14 genes in Table 1 as our query set. We looked for enrichment among Gene Ontology molecular function terms and KEGG pathway gene sets, requiring the size of the functional category to be between 5 and 500 genes and using the Benjamini-Hochberg false discovery rate as the significance threshold. The FDR q values presented are the Benjamini-Hochberg critical values. Phenotypic Analyses We tested for significant difference in the diagnostic yields between different phenotypic groups using the prop.test function in R. We calculated the significance of association between an individual having macrocephaly and their mutation status (either a mutation in a PI3K/AKT pathway gene or a mutation in an epigenetic regulation gene) using a Fisher’s exact test, which we implemented with the fisher.test function in R. We calculated the significance of association between an individual having macrocephaly in the absence of increased height and their mutation status, and the significance of association between an individual having increased height in the absence of macrocephaly and their mutation status in the same way. We tested for significant difference in the proportion of individuals with mild intellectual disability for those with a mutation in a PI3K/AKT pathway OGID gene and those with a mutation in an epigenetic regulation OGID gene using the prop.test function in R. Height GWAS Gene and Cancer Driver Gene Comparisons We obtained the list of 611 genes located in regions associated with human height through GWASs from Table S1 of Wood et al.25 We obtained a list of 260 somatically mutated cancer genes from Table S2 of Lawrence et al.26 and the somatic mutations from the tumor portal website. We calculated the probability of seeing the observed overlap of the OGID gene set with the GWAS gene set under a hypergeometric probability distribution assuming a total hypothetical size of 20,000 protein-coding genes in the exome using the phyper function in R. We calculated the probability of seeing the observed overlap of OGID gene set with the cancer driver gene set in the same way.