Single-cell data processing pipeline
For each study, a counts matrix was downloaded from a public data repository such as the Gene Expression Omnibus (GEO) or the Broad Institute Single Cell Portal (Supplementary file 1). Note that this data has not been re-processed from the raw sequencing output, and so it is likely that alignment and quantification of gene expression was performed using different tools for different studies. In some cases, multiple complementary datasets have been generated from a single publication. In these cases, we have generated separate entries in the Single Cell platform.
Table 1.  Results of evaluation.
Performance of approximately 2100 disease-gene pairs.
Assoc score↓  Cohen’s d (+)  Mann-W U norm. (-)  Logistic log loss (-)  Logistic Brier score (-)
Cosine (w2v)  1.31  0.197  0.51  0.168
Raw PMI  2.07  0.0953  0.374  0.116
Raw PMI -log(pctile)  2.15  0.0947  0.355  0.111
Exp PMI  2.17  0.0897  0.356  0.109
Exp PMI -log(pctile)  2.21  0.0903  0.341  0.105
Raw Local Score  2.35  0.0828  0.312  0.0947
Raw Local Score -log(pctile)  2.28  0.0832  0.317  0.0963
Exp Local Score  2.34  0.0812   *0.301   *0.0915
Exp Local Score -log(pctile)   *2.36   *0.0811  0.308  0.093
log(coocc)  2.24  0.097  0.348  0.105
Interpretation of the above table.
Each row corresponds to an association score whereas each column corresponds to one of the evaluation metrics. A (+) in the column means a higher evaluation metric value, the better the association score in that row separates the positive and random pairs. A (-) means a lower evaluation metric is better. Note all the metrics are immune to linear rescalings; also the Mann-Whitney U score is nonparametric. While counts matrices have been generated using different technologies (e.g. Drop-Seq, 10x Genomics, etc.) and different alignment/pre-processing pipelines, all counts matrices were scaled such that each cell contains a total of 10,000 scaled counts (i.e. the sum of expression values for all genes equals 10,000 in each individual cell). All data were uniformly processed using the Seurat v3 package (Butler et al., 2018). In short, this pipeline involves the following steps. First, we identify 2000 variable genes across the given dataset and then perform linear dimensionality reduction by principal component analysis (PCA). Using the set of principal components which contribute >80% of variance across the dataset, we then do the following: (i) perform graph-based clustering to identify groups of cells with similar expression profiles (Louvain clustering), (ii) compute UMAP and tSNE coordinates for each individual cell (used for data visualization) and (iii) annotate cell clusters. Note that the three human pancreatic datasets (GSE81076, GSE85241, GSE86469) were integrated together in a shared multi-dimensional space using CCA (Canonical Correlation Analysis) and the integration method in the Seurat v3 package (Butler et al., 2018). Cell clustering and computation of dimensionality reduction coordinates were performed on this integrated dataset.