Material and Methods

Altrans Method for Relative Quantification of Splicing Events
Altrans is a method for the relative quantification of splicing events. It is written in C++ and requires a BAM alignment file15 from an RNA-seq experiment and an annotation file in GTF format containing exon locations. The BAM file is read using the BamTools API.16 Altrans utilizes paired end reads, where one mate maps to one exon and the other mate to a different exon, and/or split reads spanning exon-exon junctions to count “links” between two exons. For reads aligning to multiple locations in the genome with the same mapping quality, only the primary alignment, i.e., the one reported in the BAM file, is considered and alternative alignments that are reported as tags in the BAM file are ignored. The first exon in a link is referred to as the “primary exon.” The algorithm is as follows:1. Group overlapping exons from annotation into exon groups. Because we are quantifying splicing events and not individual transcripts, transcript level information is ignored and exons with identical coordinates belonging to multiple transcripts are treated as one unique exon (Figure 1A).
2. In order to assign reads to overlapping exons, identify unique portion(s) of each exon in an exon group. Exons with immediate unique portions, where there is no other overlapping exon and where a read can be unambiguously assigned to the exon (E2u,1 in Figure 1A), are called “level 1 exons.” For exons with no unique positions, remove the level 1 exons from the exon group to determine regions that identify the remaining exons uniquely, where again there is no other overlapping exon after the removal of the level 1 exons (E2u,2 in Figure 1A), and increment the level of these exons. In the rare cases where an exon shares its start position with one exon and its end position with another, causing it to have no unique portion, then this exon is removed from the analysis in order to be able to assign unique portions to the remaining exons in the same exon group. In cases where a larger exon overlaps and fully contains two smaller exons, the insert size distribution is used to probabilistically assign links between the two smaller exons (please refer to the Altrans manual for a more detailed annotation of these rare cases). Iterate through this process of removing exons that have unique regions until all exons in a group have unique portions (Figure 1A).
3. Use these unique portions to assign mate pairs or split reads to links (Figure 1A). Links assigned to unique portions that exist only after the removal of overlapping exons are putative assignments and “deconvolution” of these is handled in the next step.
4. Because all the exons have ambiguous unique portions and share regions with other exon(s), reads aligning here might belong to multiple exons. In order to unambiguously quantify links between these exons, we calculate “link coverage” for all pairs of exons in a given window size. The default method divides the link counts with the probability of observing such a link given the insert size distribution, which is empirically determined from pairs aligning to long exons (Figure 1B). The second method involves calculating the number of degrees of freedom linking two exons given the empirically determined most frequent insert size and read length (Figure 1C). This is an alternative model to calculating link coverage, but we recommend using the default model unless you have a very tight distribution of insert sizes. The link coverage metric ensures that the link counts are normalized for the specific insert size distribution of the experiment, which has direct effect on the observed link counts. Hence, link coverage allows us to quantify an exon link from the unique portions only, i.e., this value should be equivalent to the one we would calculate if we were able to measure the whole exon. The coverage between level 1 exons can be calculated directly using the unique portions, whereas links between higher-level exons are calculated by iteratively subtracting coverage of all the other lower level links from the coverage of these links (Figure 1E).
5. Calculate the quantitative metric, F value, for one exon link as the coverage of the link over the sum of the coverages of all the links that the primary exon makes (Figures 1D and 1E). Using this fraction rather than link counts or coverage ensures that the metric is independent of global effects on gene expression.
6. Repeat step 5 in both 5′-to-3′ (forward) and 3′-to-5′ (reverse) directions to capture splice acceptor and donor effects, respectively (Figure 1E).
Please refer to the Altrans manual where the method is annotated in more detail and examples of how to run Altrans are provided. The program also allows the user to calculate an F value from all the links that a primary exon makes regardless of the direction. Along with the F values, the raw link counts are also outputted, which allow filtering of results eliminating low count links. These raw counts can also be normalized and subsequently reread by the program to calculate the F values. Memory usage and speed heavily depend on the complexity of the annotation and the number of reads in the alignment file. For a sample alignment with 50 million reads and an annotation with 539,748 unique exons, Altrans ran for 20 min and consumed 784 MB of RAM on a single 2.2 GHz core under Linux.

Conversion of Transcript Quantifications to Link Quantifications
We convert the transcript quantifications generated with Cufflinks to relative link quantifications. This is achieved by assigning the same quantification to all linked exons of a transcript based on the measured quantification of the said transcript. We then apply the same method of relative link quantification used in the Altrans algorithm, specifically steps 5 and 6 in the previous section, to calculate the F value for all the links a primary exon makes.

Simulation Analysis
In order to benchmark the link quantifications generated by Altrans, we conducted a simulation analysis using the Flux Simulator software.17 We simulated an RNA-sequencing experiment with 50 million reads with the GENCODE v.12 annotation18 reflecting cases where we have a perfect annotation describing all the observed transcripts in the data. Additionally, we introduced novel transcripts, made up of existing exons of a gene, into the annotation. This was achieved by creating novel combinations of exons of a gene while checking for compatibility of the randomly selected exons (non-overlapping, order matches the genomic order, and where a UTR or first or last exon is not an internal exon) and keeping the distribution of number of exons of the random transcripts similar to that of the known transcripts. We then simulated 5 cases with 50 million reads where the novel transcripts accounted for 5%, 10%, 25%, 50%, and 75% of all transcripts, reflecting cases where the annotation is not perfect. Altrans, Cufflinks,11 and MISO13 were run on these 6 simulated datasets using the standard GENCODE v.12 annotation. In each simulation, the “correct” quantification of a transcript is taken as the RNA molecule count that the Flux Capacitor used to simulate reads for a given transcript. We have converted these “correct” transcript quantifications and the measured transcript quantifications of Cufflinks and MISO to exon link quantifications, as described in the previous section, and correlated the simulated expected link quantifications with the measured link quantifications for the three programs in the six simulation scenarios, using links where there were overlapping reads or links that were quantified in both the simulation and the given program. We measured the concordance between the simulated and measured quantifications via Spearman’s correlation. The estimates of novel splicing in a dataset are done through counting the number of uniquely mapping split reads. We then take junctions that are represented by at least eight split reads and check whether this junction is present in the annotation.

cis-Alternative Splicing QTL Discovery by Each Method in the Geuvadis Dataset
The RNA-seq reads were aligned to the human reference genome (GRCh37) using the GEM aligner19 and alignments were filtered for properly paired and uniquely mapping reads (mapping quality greater than or equal to 150). Genotypes originated from 1000 Genomes phase 1 data, which is based on 1,092 individuals with 5× whole-genome sequencing data, 80× exome sequencing data, and high-quality genotyping. The genotype data were filtered for variants with MAF < 5% and HWE p < 1 × 10−6 for each population separately and were corrected for population stratification using the first three and two eigenvectors for Europeans and Africans, respectively.7 The Altrans link counts were normalized using the first 15 principal components calculated from these link counts. We first looked at all pairwise links between exon groups considering the union of all exons in the exon group as one entity and filter so that we keep only pairs of exon groups that have 15 links in 80% of the samples. Then we count the links between exons of the initial exon group and exons of the terminal exon group and keep only links where the exon in the initial exon group made at least ten links with any of the exons in the terminal exon group in at least 30% of the samples. Cufflinks quantifications were run using the annotation with the –GTF option. In the case of Cufflinks, the transcript quantifications were converted to link quantifications and we assessed links originating from the same genes where there were Altrans quantifications. The cis-window for asQTL discovery was 1 Mb flanking the transcription start site of each gene. The associations were run with the FastQTL package.20 The observed nominal p values were calculated by correlating the genotype and link quantifications, which were Gaussian transformed. Subsequently, we ran permutations for each link separately to assign empirical p values to each link. The permutation scheme involved permuting all links of a given gene together 1,000 times and in each permutation iteration, we record the most significant p value from an association between any variant in the cis- window and any link of a given gene, thereby accounting for the dependencies among the link quantifications of a gene, allowing us to find significant asQTL genes. From this distribution of null p values we use an approximation using the beta distribution to estimate the extremes of the null p value distribution, and using this we calculate an adjusted p value. These adjusted p values are then corrected for multiple testing using the qvalue R package.21

Classification of Splicing Events
The alternative splicing events were classified into ten categories: alternative 3′ splice site, alternative 3′ UTR, alternative 5′ splice site, alternative 5′ UTR, alternative first exon, alternative last exon, mutually exclusive exon, skipped exon, tandem 3′ UTR, and tandem 5′ UTR. For more information on these events, refer to Wang et al.5 We then classify each primary exon into these classes based on all of the observed links of the primary exon. This means that a primary exon can be involved in multiple splicing events. From these classifications, we then calculate the proportion of each splicing class in the pool of significant primary exons. This method of classification was chosen because each link quantification is dependent on the quantification of all the other links that a primary exon makes.

Functional Enrichment of asQTLs
To compare the asQTL variants to a null distribution of similar variants without splicing association, we sampled genetic variants in the same cis-window of 1 Mb surrounding the transcription start site (TSS) and matched them to alternative splicing variants with respect to relative distance to TSS (within 5 kb) and minor allele frequency (within 2%). The variant effect predictor (VEP)22 tool from Ensembl was modified to produce custom tags that were STOP_GAINED, SPLICE_DONOR, SPLICE_ACCEPTOR, and FRAME_SHIFT. This modified version of VEP was applied to the imputed genotypes using the GENCODE v.1218 annotation. To this we added information of overlap with chromatin states23 and the Ensembl regulatory build,24 which constituted our functional annotation. The enrichment for a given category was calculated as the proportion between number regulatory associations in a given category and all regulatory variants over the same proportion in the null distribution of variants. The p value for this enrichment is calculated with the Fisher exact test.