Material and Methods Cell Lines Lymphoblastoid cell lines (LCLs, n = 68) derived from unrelated Yoruba individuals in Ibadan, Nigeria, were obtained from Coriell Institute for Medical Research. They have been genotyped at more than 3.1 million SNPs22 and had their RNA quantified by expression arrays,23,24 exon arrays,25 and RNA-seq.26 LCLs were cultured in RPMI 1640 (Mediatech)/20 mM l-glutamine (Mediatech) plus 20% FBS (HyClone Laboratories) for the initial passage and then passaged every 48 hr with LCL medium containing 15% FBS. Cell suspensions were transferred to 25 cm2 flasks and incubated at 37°C in a humidified 5% CO2 atmosphere. Cell lines were maintained at a concentration of 3.5–4.0 × 105 cells/ml and harvested after the fourth passage, if viability was ≥85%. Protein Isolation Three pellets from each of 68 YRI LCLs were independently thawed, cultured, split, pelleted, and stored at −80°C. Pellets from each independent freeze-thaw were resuspended in 1.0 ml of 1.5% SDS lysis buffer (240 mM Tris-acetate, 1.5% w/v SDS, 0.5% w/v glycerol, 5 mM EDTA) containing 50 mM DTT, protease (1 μg/ml aprotinin, 1 μg/ml leupeptin, 1 μg/ml pepstatin), and phosphatase inhibitors (1 mM sodium orthovanadate, 10 mM β-glycerophosphate). To ensure complete protein denaturation, samples were boiled for 10 min, sonicated for 10 min (alternating 30 s on, 30 s off) with a Bioruptor (Diagenode), and concentrated to 5–10 μg/μl in a 96-well microconcentration device with a 10 kDa molecular weight cutoff (Millipore). Antibody Screening The first antibody set comprised 296 previously validated antibodies directed against 200 unique cell signaling proteins.21,27,28 The second antibody set represented a completely uncharacterized set of 4,070 antibodies directed against 1,848 unique TFs. Three biological replicates for each of 11–12 individuals were pooled together into 6 pools for screening of these 4,366 rabbit polyclonal antibodies at a 1:1,000 dilution. Printing, gel fabrication, horizontal semidry electrophoresis, transfer, blotting, and scanning were performed as in Ciaccio et al.,21 permitting 96 antibodies to be screened over six pooled lysates per MWA. Antibodies were probed in the 800 channel using goat anti-rabbit IR800-conjugated secondary antibodies (1:5,000) (Invitrogen). A validated mouse monoclonal antibody to β-actin (1:1,500) (Cell Signaling 3700) was included on each blot as a printing control and was measured using a goat anti-mouse Alexa Fluor 680-conjugated secondary antibody (1:7,500) (Invitrogen) in the 700 channel. Fluorescence was quantified using LI-COR Odyssey software (v.3.0) by drawing features around the appropriately sized bands for each sample with a fluorescent protein marker (LI-COR 928-40000) acting as a standard for molecular weight. The raw integrated intensities of each feature were background subtracted using the median of the three pixels surrounding the feature as an estimate of local background, the maximum number of pixels permitted by the LI-COR Odyssey image analysis software to be used for local background estimation. These corrected integrated intensities were used to calculate the average background-corrected integrated intensities of replicate spots. Antibodies that displayed a single predominant band of the predicted size of the targeted protein isoform of interest that accounted for >75% of the entire signal of the lane with a signal-to-noise ratio ≥3 were selected for subsequent population-level quantification by RPPAs; antibodies that displayed at least one band of the predicted size of the targeted protein isoform of interest with a signal-to-noise ratio ≥3 but also additional bands were selected for subsequent population-level quantification by MWAs. Antibodies that passed this screen are listed in Table S2 available online. RPPA Protein Level Quantification Four technical replicates of each of three biological replicates of all 68 individuals were spotted using a noncontact piezoelectric microarrayer (GeSiM Nanoplotter 2.1E) onto nitrocellulose membranes (BioRad). Serial dilutions of each of the six pooled lysates used for the original antibody screen and lysates from an A431 cervical carcinoma cell line (which was used as a positive control for antibodies) were also printed for each array to ensure linearity of antibody signal. Features with a background-subtracted integrated intensity <0 or signal-to-noise ratio <3 (z test p > 0.05) were identified in each array and excluded from further analysis. The distributions of background-corrected integrated intensities for all features on each array were first log2-quantile normalized using the limma29 package in R to correct for overall antibody hybridization efficiency differences in the signal. The relative level of a given protein for a sample was then quantified using the linear model yjp ∼μjp + λj + e (Equation 1), where μjp is the log-quantile-normalized, background-corrected integrated intensity of sample j on array p, and λj is the effect due to sample j across all arrays in a print (due to differing amounts of total protein spotted on the array for each sample), estimated by medianj (μjp). Notably, we performed an extension of median loading normalization30 and did not normalize to housekeeping proteins such as β-actin or α-tubulin as is typically performed with traditional western blotting to correct for sample load, because interindividual β-actin mRNA levels varied by two orders of magnitude in the RNA-seq data. Odyssey output text files were parsed in Python and quantified and normalized in R. MWA Protein Level Quantification Three technical replicates of each of the three biological replicates of all 68 individuals were spotted as above onto polyacrylamide gels. Gel fabrication, horizontal semidry electrophoresis, transfer, and scanning were performed as in Ciaccio et al.21 with the exception of separating each blot into four quadrants rather than using a 96-well gasket. This method allowed for 68 samples to be quantified with a single antibody in triplicate from each of four quadrants. Feature extraction and data normalization were performed the same as with RPPAs. For antibodies that produced multiple significant bands (signal-to-noise ratio > 3), all isoforms were quantified and their relative molecular weights recorded. The level of a given protein for an individual was quantified using the above linear model (Equation 1) with the addition of a batch term (β) to correct for global intensity distribution differences across multiple MWAs for the same antibody. We averaged measurements across replicates within platforms for the same antibody across the entire population. For replicates across platforms, we selected the platform that yielded the highest median background-corrected integrated intensity. To reduce the inflated effect of technical noise because of low antibody signals and provide more accurate interindividual protein level measurements, antibodies in the bottom deciles of median background-corrected integrated intensities or in the top deciles of technical coefficients of variation for either platform were flagged and eliminated from further analyses. Quality Metrics of the Protein Measurements To correct for differences in the total amount of protein deposited for each sample for each array, we estimated a sample load effect by regressing out the median protein measurement for each sample. This measurement was highly correlated with the first principal component of the protein data, as the overall concentration of each sample was directly related to the amount of each protein (R2 > 0.95). To assess the quality of our protein data, we plotted the coefficients of variation for each antibody quantified versus the median signal-to-noise ratios and background-corrected integrated intensities (Figure S1). Similar to the effect observed with expression microarrays, we observed relatively high technical variation for antibodies of low fluorescence signal and a trend toward decreased variation as fluorescence signal began to exceed the noise floor of our proteomic platform. Therefore, we removed protein measurements in the bottom quartiles of signal-to-noise ratio and background-corrected integrated intensity and the top quartile of coefficient of variation. The application of these filters reduced the effect of technical variation on our later inferences. We observed a median σ of 0.47 between interindividual protein levels quantified by RPPAs and MWAs from seven antibodies quantified across both platforms (example shown in Figure S2). Comparatively, the median σ between expression arrays and RNA-seq for any given transcript across the same population of YRI individuals has been shown to be approximately 0.12.26 To validate that the antibodies generated against epitopes within each protein were targeting the protein of interest, we performed two analyses. First, for the 57 pairs of antibodies directed at different epitopes for the same protein that passed our screen, we tested for correlated measurements between interindividual protein levels measured by both antibodies. We observed that 44 of the 57 had correlated measurements (ρ > 0). Discordance between multiple antibodies to the same protein could be because of technical variation or differential isoform levels, because each epitope is directed to a unique region of each protein. Second, approximately 50 amino acids surrounding known antibody epitopes (because exact epitope information was proprietary) were remapped to the human genome (UCSC Genome Browser build hg18) using BLAT31 and epitopes that contained at least one nonsynonymous SNP from dbSNP (release 132)32 or matched multiple regions in the genome with at least 95% identity were flagged but retained, because the proprietary nature of epitope disclosure prevented us from knowing which ∼5–8 amino acid fragment of the 50 amino acids was used to create the antibody. Gene Expression Data Expression array data for 53 individuals included in our study from Illumina’s human whole-genome expression arrays (WG-6 v.1) from Stranger et al.24 were downloaded from Gene Expression Omnibus (GSE6536). Probes were remapped to the human genome (UCSC Genome Browser build hg18) using BLAT31 and probes that mapped to a single location with less than 100% sequence identity or mapped to multiple locations with up to two mismatches were discarded. We then excluded probes that contained at least one SNP in dbSNP (release 132)32 or our imputed common SNP genotypes for our cohort or overlapped copy-number variants in the YRI population.33 Exon array data for 52 individuals overlapping our study from the Affymetrix GeneChip Human Exon 1.0 ST Array platform from Zhang et al.34 were downloaded from Gene Expression Omnibus (GSE9703). Probes were mapped to UCSC Genome Browser build hg18 and probes containing at least one SNP were removed from probe set signal intensity files. Gene-level expression of transcript clusters was summarized with RMA35 and averaged within unique Ensembl gene annotations. RNA-sequencing data were obtained for all individuals in our study from Pickrell et al.26 Gene expression values were calculated as the number of GC-corrected reads mapping to a gene in an individual, divided by the length of the gene in kilobases and the number of mapped reads across all lanes for that individual in millions (RPKM). Cellular Covariates and Hidden Confounders We quantified the EBV copy number in all LCLs. EBV copy number was assessed with a Taqman Gene Expression Assay (Pa03453399_s1). Intrinsic growth rates for each cell line from Im et al.36 and baseline ATP and mitochondrial DNA levels from Choy et al.37 were also included in the analyses. To identify potential additional unobserved confounders, we applied surrogate variable analysis (SVA) to the matrix of 68 × 3 protein level measurements after including the effects of known nongenetic confounders to identify 16 additional significant surrogate variables. Covariate Modeling For each protein measurement, we constructed a linear mixed effects model y ∼p + E + M + A + G + S + P + T|I + SVi..n + e, in which p is the array- and sample-load-normalized integrated intensity for all biological replicates in the population, E is the fixed effect of individual EBV copy number, M is the fixed effect of individual mitochondrial DNA copy number, A is the fixed effect of individual baseline ATP levels, G is the fixed effect of individual intrinsic growth rate, S is the fixed effect of individual sex, P is the fixed effect of individual phase, T|I is the random thaw effect per individual, SVi..n are the effects of a matrix of 16 significant surrogate variables, and e is the residual error. The model was fitted to each protein by residual maximum likelihood using the lmer function in the R package lme4 (v 0.999999-0). Fixed effect p values for covariates were estimated using the pamer.fnc function in the LMERConvenienceFunctions package (v.1.6.8.3). The significances of covariate effects were assessed by estimating false discovery rates using Storey’s q value method. Genotype Data HapMap genotypes were obtained from the 1000 Genomes June 2011 phase I low-pass whole-genome SNP genotype release and transformed to UCSC Genome Browser (hg18) coordinates. Missing values were imputed by BIMBAM (v.1.0) using the default parameters to derive mean imputed genotypes. SNVs with MAF < 0.05 and SNVs with significant deviation from Hardy-Weinberg equilibrium (Fischer’s exact test, p < 0.001) were excluded, reducing the set to 9,345,571 SNPs and indels for association analyses. Association Analyses For each protein and transcript trait, interindividual levels were inverse normal transformed and tested for association with all markers genome-wide. Association testing was performed by linear regression implemented in Python and R using custom scripts. For each trait, we selected the most significantly associated SNV within each recombination window, defined by splitting the genome into 25,307 blocks flanked by >10 cM/Mb recombination rates estimated from HapMap.22 All SNV-protein associations with p < 10−4 for proteins with more than one biological replicate were validated with the linear mixed-effects model y ∼p + G + T|I + e with a fixed genotype variable, G. Significant Associations We performed genome-wide permutations to assess the significance of the association results. We permuted the 468 protein values for each biological replicate for all individuals, performed genome-wide association on the permuted and normalized phenotypes, and repeated this procedure for three replicates, selecting each time the best signal per phenotype. Permuted SNV-protein associations with p < 10−4 were tested with a linear mixed-effects model as above. False discovery rate (FDR) was calculated as the fraction of significant hits in the permuted versus the observed data at a given p value threshold. FDR was calculated separately for cis and trans pQTLs. Results are presented at FDRs of 5% and 20%, meaning that an estimated 5% and 20% of the pQTLs correspond to false positives, respectively. We chose to perform association analyses on protein and mRNA measurements without covariate and SV correction because correction for known covariates, SVs, or both did not improve RNA-protein correlations (p > 0.05, Wilcoxon rank sum test). We observed fewer cis and trans pQTLs at an FDR < 0.20 after correction (suggesting that we might be discarding some fraction of genetic variation associated with protein level variation, as has been previously demonstrated in methods to optimize cis eQTL discovery by iteratively removing PCs to maximize the number of eQTLs discovered26,38) and to be consistent with all previous pQTL studies to date10–12,14,19,20 to allow more direct comparison of our results. The association analyses and FDR calculations were performed for all autosomal surrogate variables (n = 16), protein values, and genes in the mRNA expression data sets. Enrichment of Specific Types of SNVs in pQTLs and eQTLs The annotation of all SNVs was performed using SeattleSNP Annotation 129. For each unique annotation (“coding-synonymous,” “intergenic,” “intron,” “missense,” “near-gene-3,” “near-gene-5,” “nonsense,” “splice-3,” “splice-5,” “utr-3,” and “utr-5”), we used a Fisher exact test to test the null hypothesis that the fraction of that annotation type in either recombination-block-filtered eQTLs or pQTLs for overlapping gene models at p < 10−4 was equal to the fraction in all annotated SNVs. Genome-wide Association Study Results and Enrichment Analyses All SNPs published by 02/01/2012 were downloaded from the catalog of GWASs maintained by the NHGRI and filtered for 5,570 common variants (MAF > 5%) in the YRI samples examined. For overlap with eQTLs and pQTLs, we considered all SNPs in linkage disequilibrium (LD) (R2 > 80%) with the complex-trait-associated SNPs and filtered for common variants (MAF > 5%) in the YRI samples examined. To determine the enrichment for SNPs associated with each complex trait to be eQTLs or pQTLs, we focused on only the 7,222 primary-trait-associated SNPs before LD imputation to correct for LD-driven inflation of enrichment results. We then generated 1,000 randomized SNP sets each of size 7,222 and matched on MAF distribution by proportions in discrete 5% MAF bins. For each set, we determined the number of eQTLs and pQTLs at p < 10−4 for traits with at least three observed expression QTL overlaps and derived an empirical p value by comparing the proportion of random simulations in which the number of random overlaps exceeded the observed overlap. siRNA Knockdown LCLs were seeded at a density of 550,000 cells/ml 24 hr before nucleofection. Amaxa’s Cell Line 96-well Nucleofector Kit SF (Lonza) was used to perform the transfection. Cells were centrifuged at 90 × g for 10 min at room temperature and resuspended at a concentration of 1,000,000 cells in 20 μl of SF/supplement solution (included in SF Kit Lonza Catalog #V4SC2096) and 2 μM final concentration of AllStars negative Control siRNA labeled with Alexa Fluor 488 (QIAGEN) or a pool of siRNA (QIAGEN) (Table S1). The cells were nucleofected using Amaxa’s DN-100 program. Cells were allowed to rest for 10 min before the addition of prewarmed (in 37°C water bath for a minimum of 20 min) RPMI media and then another 5 min after the addition of warm RPMI media. Cells were then plated for protein harvest. Cells were harvested at 24 and 48 hr postnucleofection for protein measurement by MWAs. Protein levels were quantitated as above with three technical replicates per individual per time point and normalized within an individual across time points to the relative β-actin protein levels. Percentage knockdown was then calculated by dividing the relative targeted protein levels in the targeted siRNA sample by those in the scrambled siRNA control sample for each time point. A knockdown was declared significant if protein levels were reduced after knockdown by greater than two times the percentage standard error (p < 0.05).