Results The Deafness Variation Database The DVD classifies and interprets variants in 152 genes and microRNAs implicated in genetic hearing loss. The included genes are associated with a variety of hearing loss-related phenotypes including NSHL, NSHL mimics, and common forms of syndromic hearing loss (Table S1). 876,139 genetic variants in these genes were extracted from dbSNP, ExAC, 1000 Genomes, ESP, ClinVar, HGMD, and our internal manual curation database.8, 23, 24 All variants were annotated for MAF (from large-scale population databases), variant effect (intronic, UTR, splice-site, missense, nonsense, synonymous, inframe indels, frameshift indels, start loss, stop loss), deleteriousness predictions (dbNSFP), and classification (ClinVar, HGMD, our internal manual curation) (Figure 1). All available data were used to classify variants computationally, with supplemental expert manual curation as detailed in the Material and Methods (Figure 1). We integrated predictions from six algorithms—two assessing conservation (PhyloP and GERP++) and four evaluating deleteriousness (SIFT, PolyPhen-2, MutationTaster, and LRT)—to calculate a composite pathogenicity score (PS) and annotate variants with MAF < 0.5%. Variants with MAFs above this threshold were automatically classified as benign with the exception of known common founder mutations (Figure 1C, Table S2).16 To validate the PS, we plotted all variants classified as pathogenic by MAF and PS (Figure S1) and found that of 3,591 pathogenic variants with predictions from at least five pathogenicity prediction tools, 95.4% have a composite PS ≥ 60%. The calculated sensitivity, specificity, PPV, and NPV were 0.95, 0.51, 0.74, and 0.88, respectively. We used this threshold for variant classification, labeling variants with a MAF < 0.5 and a PS ≤ 40%, based on at least five pathogenicity predictions, as LB. In aggregate, DVD v.8.1 reports 876,139 variants from 152 genes and microRNAs. Of these variants, 7,502 (0.85%) are classified as P, 671 (0.077%) are LP, 15,287 (1.74%) are LB, 156,970 (17.9%) are B, and 695,709 (79.4%) are VUSs (Figure 2A). To assess only medically relevant variants for deafness, we considered only the 97,007 variants within coding and splice-site regions (exons as defined by RefSeq and Ensembl coding transcripts, ±20 bp from exon boundaries, 3′ and 5′ UTRs, and any deep intronic variant classified as P or LP) as these regions are routinely screened in clinical diagnostics settings. We also considered any variant that is P or LP for a phenotype other than deafness as a VUS for the purpose of this analysis. For example, 20 P/LP variants have been reported in MET (MIM: 164860), but only one has been linked to hearing loss. Of 97,007 variants we considered, 6.2% were P (6,045), 0.5% were LP (445), 14.2% were LB (13,823), 4.8% were B (4,628), and 74.3% were VUSs (72,066) (Figure 2B). Figure 2 Variant Classification by the DVD (A) Fractions of different classification categories for variants in the whole DVD. (B) A slightly different picture emerges when only clinically relevant regions and deafness-associated variants (variants that were associated with other non-related deafness phenotypes are excluded) are considered. (C) Comparative overview of DVD versus ClinVar. 7,056 classifications from ClinVar were identified within our specified gene regions (each variant in ClinVar with multiple submissions for pathogenicity has been represented by its most pathogenic submission). Of this number, 6,039 ClinVar classifications agreed with the corresponding DVD classification whereas there was disagreement for 1,017 variants. (D) Comparative overview of DVD versus HGMD. 7,845 classifications from HGMD were identified within our specified gene regions. Of this number, 7,458 classifications agreed with the corresponding DVD classification and discrepancies were found for 387 variants. (E) There were 72 major categorical changes between ClinVar and DVD that resulted in medically significant differences (53 up-classifications and 19 down-classifications). (F) 244 medically significant reclassifications were found when DVD was compared to HGMD (2 up-classifications and 242 down-classifications). (G) Of the 20% of genes carrying the greatest numbers of medically significant changes, 6 are implicated in Usher syndrome. For (C) through (F), the horizontal arrows show discordant calls, with the number of discordant classifications shown within each arrow; totals are listed to the right of the colored columns. Computational and Expert Manual Curation Led to Medically Significant Changes in Pathogenicity To assess differences in variant interpretation between the DVD and ClinVar and HGMD, we compared the number of downgraded (from more severe to more benign) and upgraded (from more benign to more severe) classifications (Figures 2C–2F). Of the variants listed in the DVD, 7,056 are found in ClinVar (filtered to represent each variant by only its most pathogenic classification). Of these variants, 175 are unique to ClinVar (Figure 1B). There was classification agreement for 6,039 (85.6%) variants. Of the 1,017 (14.4%) discordant calls, classification discrepancies of one degree were most common (715 of 1,017 changes), with the DVD being more likely to downgrade a ClinVar classification (772 downgrades versus 245 upgrades) (Figure 2C). Major classification changes for deafness-related variants that resulted in medically significant differences (variants that were upgraded to or downgraded from P/LP) were identified for 72 variants. Of these variants, there were 53 up-classifications of a variant by the DVD to P/LP and 19 down-classifications of a variant from P/LP (Figure 2E). A total of 7,845 DVD variants are found in HGMD. DVD and HGMD classifications were concordant in 7,458 (95%) cases. Of the 387 (5%) discordant calls, classification discrepancies of three degrees were most common (132 of the 387 changes), with DVD downgrades of HGMD calls more common than upgrades (312 downgrades versus 75 upgrades) (Figure 2D). There were 244 major classification changes that resulted in medically significant differences, with all except two representing downgrades by the DVD from an HGMD call of P/LP (Figure 2F). Following computational and manual curation, variants in 101 genes were reclassified in the DVD. These reclassifications included major categorical changes representing medically significant changes (P/LP versus VUS/LB/B) for 300 variants in 52 genes (Table S3). Of the 20% of genes carrying the greatest number of medically significant differences, six are associated with the diagnosis of Usher syndrome (Figure 2G, Table S4). For both ClinVar and HGMD, the same five genes carry the greatest number of major categorical changes (USH2A [MIM: 608400], SLC26A4 [MIM: 605646], GJB2, MYO7A [MIM: 276903], CDH23 [MIM: 605516]) (Figure 2G, Table S4). The remaining frequently impacted genes are WFS1 (MIM: 606201) (DFNA6/14/38 [MIM: 600965] and Wolfram syndrome), USH1C (MIM: 605242) (DFNB18A [MIM: 602092] and USH1C [MIM: 276904]), ADGRV1 (MIM: 602851) (USH2C [MIM: 605472]), and MYO15A (MIM: 602666) (DFNB3 [MIM: 600316]). Most Genetic Variants in Deafness-Associated Genes Are Missense and Rare Having built a comprehensive resource that collates and annotates all variants in hearing loss genes and provides a clinical interpretation, we sought to explore the genomic and mutational landscape of deafness-associated genes. Our first objective was to evaluate the distribution of variants with respect to their MAF and type. Of all variants in the DVD, novel, ultra-rare (0% < MAF ≤ 0.05%), and rare (0.05% < MAF < 0.5%) variants represented 36%, 11%, and 35%, respectively (Figure 3A). When only clinically relevant variants within coding and splice-site regions were considered, the general tendency did not change. Variants with MAF < 0.5% remained the most prevalent (96%) although the distribution within this set changed, with ultra-rare variants (0% < MAF ≤ 0.05%) now representing the major category (59%) (Figure 3B). The finding that variants with a MAF < 0.5% (the threshold above which a variant is too common to be deafness causing16) account for 96% of all the variants falling within coding and splice-site regions implies that only 4% of variants can be excluded as deafness causing on the basis of MAF filtering. Figure 3 Distribution of Variants by Location, MAF, and Type (A and B) MAF (all variants in DVD including intronic) (A) and only variants in gene coding regions (B). Most coding variants (96%) in deafness-associated genes are novel or rare (MAF < 0.5%). (C) Distribution of variant by their gene location. (D) Coding variant breakdown by type showing that missense variants constitute the major set of all coding variants. Abbreviations: FS, frameshift; SS, splice-site; inF, in-frame. Of all variants within deafness-associated genes, ∼12% were located in the coding regions and canonical splice sites (Figure 3C). Missense variants represent the major set of all coding variants at 62%. The second most common type are synonymous variants (28%) followed by indels (4% frameshift and 2% inframe), nonsense (2%), canonical splice-site (2%), and start/stop loss (<1%) (Figure 3D). Disparity in Gene Variation Rates As expected, the number of variants per gene correlated with gene size, with larger genes carrying higher numbers of variants (Figures 4A and S2A, Table S5). The greatest variant load was found in PCDH15 (MIM: 605514), USH2A, ADGRV1, and CDH23, but when the number of variants was normalized for gene size, different trends emerged (Figures S2B and S2C, Table S5). ACTG1 (MIM: 102560) had the highest variant rate at 41% (4 of 10 bases carry reported variants), with most genes (85%) having a variation rate below 10%. If we restricted the analysis to coding and splice-site regions, again there was a correlation between the number of variants and the size of the coding regions, with USH2A, ADGRV1, and CDH23 carrying the highest number of variants (Table S5). Normalizing to the size of the coding region, however, gave strikingly different results: GJB2 carried the greatest variation at ∼69% (nearly 7 of 10 bases carry reported variants) and six other genes had variation rates higher than 30%: WFS1 (53%), KCNQ1 (MIM: 607542) (44%), ACTG1 (39%), SLC26A4 (37%), and KCNE1 (MIM: 176261) (36%). The average variation rate was ∼22% (Figures 4B and S3A). Figure 4 Variation Rate for Deafness-Associated Genes (A) Total number of variants per gene. (B) Normalized number of coding variants based on the size of the coding and splice regions. (C) Normalized number of deafness-associated variants (P+LP) based on the total number of coding variants. Only genes with ≥14 reported deafness-associated variants are included in this figure; the remaining genes are shown in Figures S2 and S3. To determine whether gene-specific variation rates correlated with tolerance or intolerance to variation, we focused on the 6,490 variants classified as P and LP for deafness and normalized to the total number of coding variants. We found that ∼69% of coding variants in GJB2 are disease causing (P and LP variants), meaning that for any new variant identified in the coding sequence of GJB2, there is a 70% chance that it is pathogenic. Both COL4A5 (MIM: 303630) (55.3%) and SLC26A4 (47.2%) also had high (P+LP)/(Coding Variant) ratios (Figures 4C and S3B, Table S5). Variants Are Differentially Distributed across Classifications To characterize the molecular profile of variants within different classification categories, we focused on variants in coding and splice regions and grouped them by type (nonsense, splice-site, frameshift indels, start loss, stop loss, in-frame indels, missense, UTRs, intronic, synonymous) across variant classifications (P, LP, VUS, LB, B). Overall, missense variants were most prevalent in all categories (Figure 5A). For P variants, loss-of-function (LoF) variants and non-LoF were equally represented (∼50%). Of LoF variants, frameshift indels were most common (47.8%), followed by nonsense and splice-site at 27.65% and 22.27%, respectively. LP variants showed a slightly different profile with mostly missense variants at ∼70%. As expected, P variants are enriched in LoF and B variants are depleted. VUSs are enriched for missense (53.5%) and synonymous (34.1%) variants, with LoF variants representing only ∼7%. We found an enrichment of LoF variants in the P (50%) and LP (16.9%) categories, whereas they represent only 7% of VUSs and a negligible proportion of LB (0.03%) and B (0.47%) variants (Figure 5A). Missense variants are most common in the LB classification at ∼95% and represent ∼70% of the LP variants and ∼50% of P variants. Figure 5 Genomic Landscape of Deafness-Associated Genes (A) Variant architecture by each classification category shows a strikingly distinct distribution of variant types across the five classifications. (B) Distribution of LoF, missense, and synonymous variants is different across genes. (C) Most LoF variants are P/LP and some genes are highly enriched in this type of variant. (D) The contribution of missense variants to the mutational pool of hearing loss is variable across genes. However, in most genes, the majority of missense variants are VUSs. (E) The mutational spectrum is gene specific. Splice-site indicates variants in canonical splice sites. Only genes with ≥14 reported deafness-associated variants are included in this figure; the remaining genes are shown in Figures S4 and S5. Diverse Mutational Spectrum across Deafness-Associated Genes We next examined the distribution of LoFs and missense and synonymous variants by gene and observed disparity among genes as some are depleted of LoF (such as SIX1 [MIM: 601205]) whereas others are enriched in synonymous (such as ACTG1) or missense (such as ADGRV1) variants (Figures 5B and S4A). Of all LoF variants, the fraction contributing to the P/LP pool differs across genes, showing that for some genes a LoF variant is most likely to be pathogenic (SOX10 [MIM: 602229], TCOF1 [MIM: 606847], COL2A1 [MIM: 120140], COL4A5 [MIM: 303630], EYA1 [MIM: 601653], GATA3 [MIM: 131320], POU3F4 [MIM: 300039]), whereas for others it is not (e.g., ACTG1, AIFM1 [MIM: 300169]) (Figures 5C and S4B). A similar disparity is also observed for missense variants, where for some genes more than half of all missense variants are P/LP (GJB2, KCNQ1, PRPS1 [MIM: 311850]) and for others this contribution is marginal (e.g., TRIOBP [MIM: 609761], ADGRV1) (Figures 5D and S4C). Interestingly, for some genes such as BSND (MIM: 606412), TCOF1, and TRIOBP, approximately half of all missense variants are classified as B/LB, implying that a missense variant in those genes is more likely to be non-disease causing. We also noted wide variation across genes in the fractional contribution of missense versus LoF variants to the P/LP category (Figures 5E and S4D). Some genes have exclusively missense mutations (ACTG1, PRPS1, COCH [MIM: 603196], and AIFM1) while other genes were enriched in LoF mutations (TCOF1, LOXHD1 [MIM: 613072], ADGRV1, EYA1, and PCDH15). A more detailed analysis of the different types of mutations within the LoF group revealed greater variability in the fractions of nonsense, splice-site, and frameshift indels across genes (Figures 5E and S4D). For example, the majority of LoF mutations in LOXHD1 are nonsense, whereas for COL11A1 (MIM: 120280) they are splice sites. MAF Thresholds for Disease-Causing Variants Are Gene Specific Gene-specific MAF thresholds for P+LP variants ranged from 0% to 7.34%. GJB2, MYO15A, OTOF (MIM: 603681), PEX6 (MIM: 601498), and CLRN1 (MIM: 601498) had the highest MAFs at 7.34%, 2.45%, 0.79%, 0.71%, and 0.69%, respectively (Table S6). However, these maximum MAFs are misleading and do not provide an accurate MAF for the majority of disease-causing variants associated with these genes. For example, while the maximum MAF for any pathogenic variant reported in GJB2 (GenBank: NM_004004.5) is 7.3% for c.109G>A (p.Val37Ile), the median MAF for all mutations in GJB2 is surprisingly 0, reflecting the huge number of ultra-rare P+LP variants in this gene (Figures 6A, 6B, and S5, Table S6). Similar results were found for SLC26A4, USH2A, and WFS1. These discrepancies also reflect founder effects as some mutations occur solely in a single population or ethnicity and account for a large portion of that population’s hearing loss (Table S2). These critical exceptions emphasize the importance of expert curation and review of variants that exceed the 0.5% MAF cut-off. Figure 6 MAFs Thresholds for Deafness-Associated Variants Are Gene and Type Specific (A) Plot of MAFs of all P/LP variants in each deafness-associated gene. (B) Maximum MAF is gene specific and there is a clear distinction between LoF versus missense variants. (C) Overall, missense variants exhibit the highest MAFs when compared to all other variants. Only genes with ≥14 reported deafness-associated variants are included in this figure; the remaining genes are shown in Figure S5. MAF Thresholds for Disease-Causing Variants Are Type Specific To determine whether MAFs for P+LP variants were mutation-type dependent, we subdivided all variants by effect and plotted against their MAF. Although the median MAF is 0 for all variant types, synonymous and UTR variants had the highest mean MAF (0.023% and 0.027%, respectively), followed by missense (0.017%), nonsense (0.009%), splice-site (0.0047%), in-frame indels (0.0036%), and frameshift indels (0.0028%) (Figure 6C, Table S7). These results compare closely to the gene-level results and demonstrate that regardless of type and gene, disease-causing mutations are ultra-rare and are heavily comprised of novel/private variants. Kafeen and the DVD Are Configurable, Customizable, and Open-Access Resources The DVD is freely available. It is widely used by the scientific and clinical communities worldwide with ∼3,000 users and 13,000 sessions over the past 12 months (Figure S6). The Kafeen bioinformatic pipeline, upon which the DVD was built, is configurable, adaptable, and extensible, allowing incorporation of additional variant and annotation sources as well as deleteriousness prediction tools. It also allows for customizable thresholds of MAF to classify variants. Consequently, its use is not limited to deafness and could be implemented for a variety of other genetic disorders.