Figure 1.  Knowledge synthesis and the nferX Single Cell resource.
(A) Knowledge synthesis: capturing association between concepts from over 100 million documents. Schematic shows the workflow for generating literature-derived associations between phrases. Local score and global score are defined and the types of literature-derived associations are shown for combinations of high and low local and global scores. (B) Datasets enabling knowledge synthesis-powered scRNAseq analysis platform (https://academia.nferx.com/). Single-cell RNAseq data was obtained from publicly available human and mouse single-cell RNA-seq datasets. Bulk RNA-seq data was obtained from Gene Expression Omnibus (GEO) and the Genotype Tissue Expression (GTEx) project portal. Protein-level expression of coronavirus receptors was assessed using a collection of immunohistochemistry (IHC) images and tissue proteomics datasets from the Human Protein Atlas and the Human Proteome Map. Literature-derived association scores are obtained from over 100 million biomedical documents (C) Highlighting selected tissues and cell types identified by one or more modalities to express ACE2, the putative receptor of SARS-CoV-2 spike protein. Image template: https://www.proteomicsdb.org/.
Figure 1—figure supplement 1.  Validation of metrics used to assess literature-derived associations.
A)-(B) 1-d logistic model predicting true vs random concept pair associations from (A) cosine score and (B) Exponential Local Score. Extent of separation between the green true associations and red random associations indicates extent to which score captures known concept associations. (C) Normalized histograms of co-occurrence counts (on logarithmic scale) for high-cosine vs low-cosine token pairs. (D) Distribution of cosines between gene-disease token vector pairs vs null distributions of cosines between pairs of random 300-d vectors. (E) Null cosine distribution between two random vectors, as the dimension of the vectors varies.