Material and Methods CommonMind Consortium Data We used pre-QC’ed genotype and expression data from the CommonMind Consortium, and detailed information on quality control, data adjustment, and normalization procedures can be found in Fromer et al.16 Briefly, samples were genotyped at 958,178 markers using the Illumina Infinium HumanOmniExpressExome array and markers were removed on the basis of having no alternate alleles, having a genotyping call rate ≤ 0.98, or having a Hardy-Weinberg p value < 5 × 10−5. After QC, 668 individuals genotyped at 767,368 markers were used for imputation. Phasing was performed on each chromosome using ShapeIt v2.r790,17 and variants were imputed in 5 Mb segments with Impute v2.3.118 using the 1000 Genomes Phase 1 integrated reference panel,19 excluding singleton variants. After phasing and imputation, then filtering out variants with INFO < 0.8 or MAF < 0.05, the number of markers included in the analysis totaled approximately 6.4 million. Gene expression was assayed via RNA-seq using 100 base pair paired end reads and was mapped to human Ensembl gene reference (v.70) using TopHat v.2.0.9 and Bowtie v.2.1.0. After discarding genes with less than 1 CPM (counts per million) in at least 50% of the samples, RNA-seq data for a total of 16,423 Ensembl genes was considered for analysis. The expression data was voom-adjusted for both known covariates (RIN, library batch, institution, diagnosis, post-mortem interval, and sex) and 20 surrogate variables identified via surrogate variable analysis (SVA).20 After the removal of samples that did not pass RNA sample QC (including but not limited to: having RIN < 5.5, having less than 50 million total reads or more than 5% of reads aligning to rRNA, having any discordance between genotyping and RNA-seq data, and having RNA outlier status or evidence for contamination) and retaining only genetically identified European-ancestry individuals, a total of 467 samples was used for downstream analyses. These 467 individuals comprised 209 SCZ-affected case subjects, 52 AFF (bipolar, major depressive disorder, or mood disorder, unspecified)-affected case subjects, and 206 control subjects. eQTL Identification An overview of our workflow can be found in Figure S1. First, to identify primary and conditional cis-eQTL, we a conducted forward stepwise conditional analysis implemented in MatrixEQTL21 using genotype data at 6.4 million markers and RNA-seq data for 16,423 genes. FDR was initially assessed using the Benjamini-Hochberg algorithm across all cis-eQTL tests within each chromosome. FDR was not re-assessed at each conditional step; instead, a fixed p value threshold was used as the inclusion criteria in the stepwise model selection. For each gene with at least one cis-eQTL (gene ± 1 Mb) association at a 5% false discovery rate (FDR), the most significant SNP was added as a covariate in order to identify additional independent associations (considered significant if the p value achieved was less than that corresponding to the initial 5% FDR for primary eQTL). This procedure was repeated iteratively until no further eQTL met the p value threshold criteria. We used a linear regression model, adjusting for diagnosis and five ancestry covariates inferred by GemTools. Following eQTL identification, only autosomal eQTL were retained for downstream analyses. Replication in Independent Datasets Replication was performed in the HBCC microarray cohort (dbGaP: phs000979, see Web Resources) and in the ROSMAP22 RNA-seq cohort by fitting the stepwise regression models identified in the CMC data. For cases in which a marker was unavailable in the replication cohort, all models including that marker (i.e., for that eQTL and higher-order eQTL conditional on it, for a given gene) were omitted from replication. Data from the HBCC cohort was QC’ed and normalized as described in Fromer et al.16 DLPFC tissue was profiled on the Illumina HumanHT-12_V4 BeadChip and normalized in an analogous manner to the CMC data. Genotypes were obtained using the HumanHap650Yv3 or Human1MDuov3 chips and imputed using the 1000 Genomes Phase 1 reference panel. Replication of the eQTL models was performed on 279 genetically inferred European-ancestry samples (76 control subjects, 72 SCZ-affected subjects, 43 BP-affected subjects, 88 MDD-affected subjects), adjusting for diagnosis and five ancestry components. ROSMAP data were obtained from the AMP-AD Knowledge Portal (see Web Resources). Quantile normalized FPKM expression values were adjusted for age of death, RIN, PMI, and 31 hidden confounders from SVA, conditional on diagnosis. Only genes with FPKM > 0 in more than 50 samples were retained. QC’ed genotypes were also obtained from the AMP-AD Knowledge Portal and imputed to the Haplotype Reference Consortium (v.1.1)23 reference panel via the Michigan Imputation Server.24 Only markers with imputation quality score R2 ≥ 0.7 were considered in the replication analysis. GemTools was used to infer ancestry components as was done for the CMC data above. After QC, 494 samples were used for eQTL replication in a linear regression model that also adjusted for diagnosis (Alzheimer disease, mild cognitive impairment, no cognitive impairment, and other) and four ancestry components. Modeling Number of eQTL per Gene on Genomic Features We considered three genomic features (gene length, number of LD blocks in the cis-region, and genic constraint score) for our modeling analyses. Gene lengths were calculated using Ensembl gene locations. We obtained LD blocks from the LDetect Bitbucket site to tally the number of blocks overlapping each gene’s cis-region (gene ± 1 Mb). We obtained loss-of-function-based genic constraint scores from the Exome Aggregation Consortium (ExAC). A negative binomial generalized linear regression model was used to model the number of eQTL per gene based on the above variables; results were qualitatively the same using linear regression of Box-Cox transformed eQTL numbers. Backward-forward stepwise regression using the full model with interaction terms for these three variables was used to determine the relationship between genomic attributes and eQTL number. These analyses were implemented in R. cis-heritability of gene expression was estimated using the same CMC data that were used for eQTL detection, including all markers in the cis-region and implemented in GCTA.25 SNP-heritability estimates were then added to the modeling procedure described above. Tissue, cell type, and developmental time point specificity were measured using the expression specificity metric Tau.26, 27 Tissue specificity for each gene was calculated using publicly available expression data for 53 tissues from the GTEx project28 (release V6p). Expression for each tissue was summarized as the log2 of the median expression plus one, and then used to calculate tissue specificity Tau. Cell type specificity for each gene was computed using publicly available single-cell RNA-sequencing expression data29 generated from human cortex and hippocampus tissues. Raw expression counts for 285 cells comprising six major cell types of the brain were obtained from GEO (GSE67835) and counts data were library normalized to CPM. Expression for each cell type was summarized as the log2 of the mean expression plus one, and then used to compute cell type specificity Tau. Developmental time point specificity for each gene was calculated using publicly available DLPFC expression data for 27 time points, clustered into eight biologically relevant groups, from the BrainSpan atlas (see Web Resources). Eight developmental periods30 were defined as follows: early prenatal (8–12 pcw), early mid-prenatal (13–17 pcw), late mid-prenatal (19–24 pcw), late prenatal (25–37 pcw), infancy (4 months–1 year), childhood (2–11 years), adolescence (13–19 years), and adulthood (21+ years). Expression for each time point was summarized as the log2 of the median expression plus one and then used to calculate developmental period specificity Tau. Each Tau was added to the above model for eQTL number individually, as well as all together. Enrichment Analyses We divided eQTL into separate subgroups by stepwise conditional order (first, second, and greater than second) and created sets of matched SNPs drawn from the SNPsnap31 database for each subgroup, matching on minor allele frequency, gene density (number of genes within 1 Mb of the SNP), distance from SNP to TSS of the nearest gene, and LD (number of LD-partners within r2 ≥ 0.8). For each subgroup of eQTL, we performed a logistic regression of status as eQTL or matched SNP on overlap with functional annotation, including the four SNP matching parameters as covariates. Enrichment was taken as the regression coefficient estimate, interpretable as the log-odds ratio for being an eQTL given a functional annotation. Functional annotations tested included: brain promoters and enhancers (union of all brain region TssA and Enh+EnhG intervals, respectively, from the NIH Roadmap Epigenomics Project32 ChromHMM33 core 15-state model), brain-specific promoters and enhancers (the union of all brain region TssA and Enh+EnhG intervals, excluding those present in seven other non-brain tissues/cell types: primary T helper cells from peripheral blood, osteoblast primary cells, HUES64 cells, adipose nuclei, liver, NHLF lung fibroblast primary cells, and NHEK-epidermal keratinocyte primary cells), and pre-frontal cortex (PFC) neuronal (NeuN+) and non-neuronal (NeuN−) nucleus H3K4me3 and H3K27ac ChIP-seq marks from the CMC. For each data source, active promoter and enhancer (or H3K4me3 and H3K27ac) annotations were tested for enrichment jointly. This analysis was repeated but restricting to matched SNPs located within 1 Mb of any of the 16,423 genes that were tested for eQTL, in order to determine whether the enrichment estimates were inflated due to the proximity of our primary and conditional eQTL to brain-expressed genes, which may be more likely to occur near active regulatory regions in the brain. In addition, to ensure that any enrichment patterns observed were not due to varying effect size among primary and conditional eQTL, the enrichment analyses were also carried out taking into account the variance in expression explained by each eQTL. Variance explained (R2) was estimated using the variancePartition34 R package, and eQTL were stratified into three R2 bins: bin 1, 1 × 10−2 ≤ R2 ≤ 1.75 × 10−2; bin 2, 1.75 × 10−2 ≤ R2 ≤ 2.25 × 10−2; and bin 3, 2.25 × 10−2 ≤ R2 ≤ 3 × 10−2. Logistic regression of status as eQTL or matched SNP was then carried out separately for each R2 bin, within each eQTL order. Conditional eQTL Analyses In order to isolate each conditionally independent cis-eQTL association, we carried out a series of “all-but-one” conditional analyses, implemented within MatrixEQTL,21 for each gene possessing more than one independent eQTL. As these conditional eQTL signals were to be used to test for co-localization with the SCZ GWAS signals, we limited these analyses to those genes (346 in total) with eQTL overlapping GWAS loci. For each of these genes, we conducted an all-but-one analysis for each independent eQTL by regressing the given gene’s expression data on the dosage data, including all of the other independent eQTL for that gene as covariates in addition to diagnosis and five ancestry components. For example, three conditional analyses would be conducted for a gene with three independent eQTL: one analysis conditioning on the secondary and tertiary eQTL, one analysis conditioning on the primary and tertiary, and one analysis conditioning on the primary and secondary. In this manner we generated summary statistics for each independent eQTL in isolation, conditional on all of the other independent eQTL for that gene. Co-localization Analyses For our co-localization analyses, we used summary statistics and genomic intervals from the 2014 Psychiatric Genomics Consortium (PGC) SCZ GWAS.35 We included 217 loci at a p value threshold of 1 × 10−6 (excluding the MHC locus), defined these loci by their LD r2 ≥ 0.6 with the lead SNP, and then merged overlapping loci. GWAS and eQTL signatures were qualitatively compared using p value-p value (P-P) plots, rendered in R, and LocusZoom36 plots. Multiple methods that aim to identify GWAS-eQTL co-localized loci are currently available.37, 38, 39, 40, 41, 42 We chose to further develop coloc39 for our co-localization analyses for several reasons: (1) it uses data from all SNPs within a locus; (2) it avoids the computational burden or approximate results of Bayesian inferential methods for causal variants,41, 42 which rely on reference panel estimates of linkage disequilibrium (LD); and (3) and it has been widely used43, 44, 45 including in direct comparisons of GWAS-eQTL co-localization methods.42, 46 We tested for co-localization using an updated version of coloc39 R functions, which we name coloc2 (see Web Resources), and incorporated several improvements to the method. First, coloc2 pre-processes data by aligning eQTL and GWAS summary statistics for each eQTL cis-region. Second, the coloc2 model optionally incorporates changes implemented in gwas-pw.43 Briefly, we implemented likelihood estimation of mixture proportions of five hypotheses (H0, no association; H1, GWAS association only; H2, eQTL association only; H3, both but not co-localized; and H4, both and co-localized) from genome-wide data. Coloc2 uses these proportions as priors (or optionally, coloc default or user-specified priors) in the empirical Bayesian calculation of the posterior probability of co-localization for each locus (eQTL cis-region). Coloc2 averages per-SNP Wakefield asymptotic Bayes factors (WABF)47 across three different values for the WABF prior variance term, 0.01, 0.1, and 0.5, and provides options for specifying phenotypic variance, estimating it from case-control proportions or estimating it from the data.