> top > docs > PMC:4157143 > spans > 5764-19702

PMC:4157143 / 5764-19702 JSON TXT

Annnotations TAB JSON ListView MergeView

2_test

{"project":"2_test","denotations":[{"id":"25192044-19620212-2055624","span":{"begin":893,"end":895},"obj":"19620212"},{"id":"25192044-22955987-2055625","span":{"begin":1797,"end":1799},"obj":"22955987"},{"id":"25192044-22343431-2055626","span":{"begin":2232,"end":2234},"obj":"22343431"},{"id":"25192044-19505943-2055627","span":{"begin":2465,"end":2467},"obj":"19505943"},{"id":"25192044-21811411-2055628","span":{"begin":2670,"end":2672},"obj":"21811411"},{"id":"25192044-20220756-2055628","span":{"begin":2670,"end":2672},"obj":"20220756"},{"id":"25192044-22785314-2055629","span":{"begin":3578,"end":3580},"obj":"22785314"},{"id":"25192044-20529905-2055630","span":{"begin":4110,"end":4112},"obj":"20529905"},{"id":"25192044-23103880-2055631","span":{"begin":8845,"end":8847},"obj":"23103880"},{"id":"25192044-20220756-2055632","span":{"begin":8865,"end":8867},"obj":"20220756"},{"id":"25192044-20562413-2055633","span":{"begin":12015,"end":12017},"obj":"20562413"},{"id":"25192044-22955989-2055634","span":{"begin":12175,"end":12177},"obj":"22955989"},{"id":"25192044-15965027-2055635","span":{"begin":12218,"end":12220},"obj":"15965027"},{"id":"25192044-22064851-2055636","span":{"begin":12350,"end":12352},"obj":"22064851"},{"id":"25192044-23128226-2055637","span":{"begin":12426,"end":12428},"obj":"23128226"},{"id":"25192044-12175810-2055638","span":{"begin":12749,"end":12751},"obj":"12175810"},{"id":"25192044-22086963-2055639","span":{"begin":12799,"end":12801},"obj":"22086963"},{"id":"25192044-24092820-2055640","span":{"begin":13157,"end":13159},"obj":"24092820"},{"id":"25192044-16381927-2055641","span":{"begin":13285,"end":13287},"obj":"16381927"},{"id":"25192044-17135203-2055642","span":{"begin":13323,"end":13325},"obj":"17135203"},{"id":"25192044-22159132-2055643","span":{"begin":13363,"end":13365},"obj":"22159132"},{"id":"25192044-19850723-2055644","span":{"begin":13377,"end":13379},"obj":"19850723"},{"id":"25192044-20576703-2055645","span":{"begin":13416,"end":13418},"obj":"20576703"}],"text":"Material and Methods\n\nCell Culture and RNA Sequencing\nEpstein-Barr-virus-transformed peripheral blood B lymphocytes (catalog no. XC01463) from families from the CEU population (Utah residents with ancestry from northern and western Europe from the CEPH collection) were purchased from the Coriell Institute and grown in RPMI 1640 supplemented with 10% fetal calf serum and penicillin and streptomycin in humidified 5% CO2 at a concentration of ∼1 × 106 cells/ml. Total RNA was isolated with Trizol. RNA quality was assessed with the Agilent Bioanalyzer 2100, and RNA integrity numbers above 9 were used for cDNA production. One microgram of total RNA was used for isolating polyA-purified mRNA and subsequently used for cDNA-library construction with the Illumina TruSeq RNA Preparation Kit. Strand specificity was performed with 2'-deoxyuridine 5'-triphosphate during second-strand synthesis.27 All samples were indexed with Illumina adapters and sequenced with an Illumina HiScanSQ. We subsequently sequenced each cDNA library on an Illumina HiSeq to obtain 30 million 75 bp paired-end reads per individual. We performed RNA sequencing (RNA-seq) for all 17 individuals (all three generations); however, for eQTL association, we only used the 11 children, and for ASE analysis, we used the two parents and 11 children. All RNA-seq data for all 17 individuals are freely available at the Gene Expression Omnibus under accession number GSE56961.\n\nQuantification of Gene Expression, Splicing, and ASE\nWe used Tophat and Cufflinks to obtain gene-expression levels from RNA-seq. We used Tophat to map RNA reads to the human reference genome (UCSC Genome Browser, hg19) and Cufflinks to quantify transcript-expression levels. Gene-expression levels were the sum of transcript-expression levels. Gencode28 v.12 was used as the input annotation for Cufflinks. We calculated transcript ratios to quantify alternative splicing patterns. Gene-expression and transcript-ratio data for Geuvadis samples were downloaded from the Geuvadis website; we used quantified gene-level reads per kilobase per million both before (for assessing effect sizes) and after (for eQTL mapping) normalization via probabilistic estimation of expression residuals.29 We assessed ASE by counting RNA read depth at heterozygous sites. We performed multiple quality-control steps to reduce known technical artifacts (see Figure S2). We obtained read counts at each heterozygous site by using SAMtools30 mpileup and our own ASE pipeline based on a binomial test modified for reference-mapping bias with a filter for observing at least five reads for each allele and a minimum read depth of 20× per site.21,31 To assess the quality of ASE estimates, we compared ASE correlation between double-IBD (identical-by-descent) siblings, half-IBD siblings, and non-IBD siblings. Indeed, we observed an expected increase in correlation between degree of IBD and allelic ratio measured across all sites (Figures S28–S30).\n\nWhole-Genome Sequencing Data\nWhole-genome sequencing data for the family were downloaded from the Complete Genomics website. Family members were originally sequenced to an average genome-wide coverage of 80×. We used variants called by the Complete Genomics Analysis Pipeline (v.2.0.0). We performed an additional filtering step testing for Mendelian inconsistency to obtain a high-confidence set of variants, and we eventually retained 5,546,682 out of the original 6,181,281 SNPs. We further compared our selected variants to those assessed by long-fragment read (LFR) technology (N50s 400–1,500 kb).32 LFR has a claimed error rate of 1 in 10 Mb. Our comparison showed that variant concordance between the 80× shotgun-sequencing approach and LFR technology was 99.91% to 99.95% (Table S2). In addition, the same family was also sequenced to 50× by Illumina Platinum Genomes, and the genotyping concordance with Complete Genomics was found to be 99.62% to 99.83% (Table S4).\n\nHaplotyping and Verification by Long-Fragment Sequencing\nWe inferred recombination positions and haplotypes of the family by using our software tool Ped-IBD.33 Haplotype blocks are defined by recombination positions. We identified a total of 813 recombination positions over 22 chromosomes. Haplotype blocks range in size from 0.02 to 12 Mb (90% interval) and have a median length of 1.65 Mb. We further confirmed haplotyping results with molecular haplotypes from the LFR technology in three individuals (NA12877, NA12885, and NA12886; Table S3). The comparison showed that phasing was 99.84% to 99.92% concordant between inferred and molecular haplotypes.\n\nLinkage Mapping of cis-eQTLs in the Family\nWe used linear regression to evaluate correlation of gene-expression levels within local haplotype blocks. We measured effect size by using the regression slope, β, and the coefficient of determination, R2. The linear model we used considers additive effects of two haplotype blocks. More specifically, for each block, the two parental haplotypes of each child are encoded with two covariates, p and m. The maternal haplotype mi of child i, for example, is either 0 or 1, depending on which of the two possible maternal alleles is present. Then, an expression trait is regressed as the summation of effects of two parental haplotypes, Ti ∼ μ + βjpi + βkmi, where Ti is the trait of individual i, the effects of two parental alleles k and j are expressed by βj and βk, and μ is the intercept. Each sibling has two choices of parental haplotypes on each side—p,m∈{0,1}—to yield four total combinations. Gene expression Ti uses log2(FPKM [fragments per kilobase per million] values). For splicing quantifications, we used relative transcript abundances, which we calculated by dividing the FPKM of each isoform by the FPKM of the whole gene (see Table S5). For cis-eQTLs, we only tested the local haplotypes containing the genes, which is sufficient for including most cis-eQTL signals (Figures S3–S5). Furthermore, we confirmed gene-expression levels and eQTL effect sizes with existing microarray data on the same family (Figures S6 and S7).\n\nComparison of cis-eQTL Effect Sizes between Population and Family\nTo compare cis-eQTL effect sizes, β, between the population and family, we sought to first correct for the overestimation of effect sizes (such discoveries exhibit characteristic regression to the mean). To address this in the population eQTLs, we divided the European-descended Geuvadis samples (n = 373) in half and partitioned them into discovery (n = 180) and replication (n = 193) panels. Within the discovery panel, we identified the strongest cis-associated variant per gene (by p value and within the same interval tested in the family). This allowed us to use the replication panel to more accurately measure the effect size of each cis-eQTL variant. However, to account for the difference in sample sizes between the replication panel (n = 193) and the family (n = 11), we further sought to estimate how much variance in effect-size measurements (β) could be obtained from sampling 11 people in the population at random. In this way we controlled for chance observations of larger effect sizes for some genes in the family. To achieve this, we repeatedly subsampled (100 times) 11 individuals from the replication panel while maintaining the exact same genotypes of the best associated variant between the subsample and the family. Figure S8 illustrates this subsample scheme. Effect sizes were then measured with the same regression formula, Ti ∼ μ + βjp + βkm, for both the family and the subsample; note that two regressors, p and m, match segregating patterns of both the haplotypes of the family and the best SNP of the population subsample. We note that estimation of β in the population was highly correlated independently of the use of a one- or two-regressor model (Figure S9). This allowed us to create a distribution of measured effect sizes that would be expected from randomly measuring the same number of individuals and genotypes in both the family and the population. Using this approach, we identified empirical p values representing how often measured effect sizes in the family were greater than that of the best associated SNP in the population. We also repeated this analysis by using fit (R2) given that we observed differences in the distribution of raw β values between the family and population and also observed higher variance in gene expression in Geuvadis overall (see Figure S17).\nWe analyzed several features that could result in over- or underestimation of effect-size measurements between the family and population (see Figures S17–S19). First, because effect-size measurements can be influenced by differences in quantification pipelines, we repeated the experiment by using different quantification approaches (Tophat + Cufflinks and GEM34 + Flux Capacitor;31 Figures S13 and S14). Second, effect sizes in the population could potentially be underestimated if the best associated SNP in the discovery panel is not causal given that subsequent effect-size measurements, in the replication panel, might not accurately measure the largest effect. To address this, we examined different discovery-panel sizes (Table S6 and Figure S15) and different criteria (Figure S16) for selecting the best SNP from the population. In addition, we observed through permutation that levels of noise in measurements of effect size (β) were different between the family and the population (Figure S17). To better gauge confidence intervals (CIs) of family effect sizes, we estimated the degree of inflation through permutation and adjusted effect-size CIs by scaling. These adjusted CIs were only applied to comparisons of β values and are denoted by CIadjusted (see Figure S17–S19). For the main manuscript, we report only unadjusted CIs. Furthermore, without using subsampling or permutation, we also directly compared effect sizes with Welch’s t test by applying analytic estimation of SEs of β. As a correctness check of the subsampling method, we compared and verified that analytic p values by Welch’s t test and empirical p values by subsampling were concordant (Figure S19).\nWe applied the same subsampling method to identify large-effect splicing quantitative trait loci (sQTLs) and ASE. To compare ASE between the family and population, we focused on a subset of genes that had substantial data for the measurement and comparison of allelic ratios (n = 1,777 genes). For a gene to be included, allelic ratios at a single site had to be measurable for at least five siblings and at least 30 population samples. We tested each gene once and excluded genes that were not tested for eQTLs, such as pseudogenes or genes within high-complexity regions (human leukocyte antigen and immunoglobulin loci). For a site to be considered measurable, it needed to be covered by a minimum of 20 reads with at least five reads for each allele. We then took the maximum allelic ratio in the family and compared it with the maximum allelic ratio found in 1,000 subsamples of the Geuvadis; each subsample was matched to the number of heterozygous individuals found in the family for that site. This approach generated an empirical p value that we used to assess whether an ASE effect in the family was greater than that in the population. To account for ASE biases caused by differing read depths between the family and population, we downsampled (hypergeometric) Geuvadis reads by a factor of 1.97—we calculated this scaling factor by measuring the average level of read-depth differences between Geuvadis and family samples at those selected heterozygous sites for each gene. To exclude the possibility that large-effect ASE was due to technical artifacts such as mapping biases or sequencing errors, we also looked at ASE for the second-largest-effect siblings and IBD siblings (Figure S25).\n\nVariant Annotation\nWe obtained annotations (missense, synonymous, regulatory, and splice region) by using the Variant Effect Predictor tool,35 which queries annotation from the Ensembl website. ENCODE transcription factor (TF) binding and DNase I hypersensitivity peaks were obtained from RegulomeDB.24 Conservation scores obtained from PhyloP36 (phyloP100way) software were downloaded from the UCSC Genome Browser. Motif-disrupting sites were downloaded from HaploReg (v.2).37 Variant allele frequency was based on phase 1 of the 1000 Genomes Project38 as calculated across European populations.\n\nConservation and Network Annotation\nWe examined the conservation of family eQTL genes between humans and chimpanzees (Pan Troglodytes) by using the dN/dS ratios; dN measures the rate of amino acid substitutions, and dS measures the background rate of neutral DNA subsitutions.39 The dN and dS values were obtained from BioMart40 (Ensembl v.70), and the dN/dS ratios were computed. dN/dS is negatively correlated with the conservation status of a gene, so higher dN/dS ratios indicate lower conservation of a gene. We also compared centrality of eQTL genes by using the protein-protein interaction (PPI) network as another indication of the biological importance of the affected genes.19 We computed connectivity of family and population eQTL genes in the PPI network. The PPI network was integrated from BioGRID,41 the Molecular Interaction database,42 the Human Protein Reference Database,43 and IntAct,44 all data obtained from the GeneMANIA45 data repository (downloaded on January 4, 2012).\n\nRare-Variant Enrichment Analyses\nTo control for site discovery and genotyping differences between the population (1000 Genomes Project) and family (Complete Genomics) genomes, we performed enrichment analyses only for variants in the family genomes. Using these data, we calculated enrichment of rare variants at large-effect-size genes by dividing the proportion of large-effect-size genes with a rare variant by the proportion of all tested genes with a rare variant."}