Introduction
Since December 2019, the SARS-CoV-2 virus has been rapidly spreading across the globe. The associated disease (COVID-19) has been declared a pandemic by the WHO, with over 5 million confirmed cases and over 300,000 deaths globally as of May 23, 2020 (Johns Hopkins Coronavirus Resource Center, 2020). The constellation of symptoms, ranging from acute respiratory distress syndrome (ARDS) to gastrointestinal issues, is similar to that observed in the 2002 Severe Acute Respiratory Syndrome (SARS) epidemic and the 2012 Middle East respiratory syndrome (MERS) outbreak. SARS, MERS, and COVID-19 are all caused by Coronaviruses (CoV), deriving their name from the crown-like spike proteins protruding from the viral capsid surface. Coronavirus infection is driven by the attachment of the viral spike protein to specific human cell-surface receptors: ACE2 for SARS-CoV-2 and SARS-CoV (Zhou et al., 2020a; Li et al., 2003; Hofmann et al., 2005), DPP4 for MERS-CoV (Raj et al., 2013) and ANPEP for specific ɑ-coronaviruses (Yeager et al., 1992). In addition to these receptors, the protease activity of TMPRSS2 has also been implicated in viral entry (Hoffmann et al., 2020; Gierer et al., 2013).
In a recent clinical study of COVID-19 patients from China, 48% of the 191 infected patients studied had comorbidities such as hypertension and diabetes (Zhou et al., 2020b). Epidemiological and clinical investigations on COVID-19 patients have also suggested fecal viral shedding and gastrointestinal infection (Xu et al., 2020a; Gu et al., 2020; Xiao et al., 2020). In the case of the earlier SARS epidemic, multiple organ damage involving lung, kidney, and heart was reported (Yang et al., 2010). The mechanisms by which various comorbidities impact the clinical course of infections and the reasons for the observed multi-organ phenotypes are still not well understood. Thus, there is an urgent need to conduct a comprehensive pan-tissue profiling of ACE2, the putative human receptor for SARS-CoV-2.
A deep profiling of ACE2 expression in the human body demands a platform that synthesizes biomedical insights encompassing multiple scales, modalities, and pathologies described across the scientific literature and various omics siloes. With the exponential growth of scientific (e.g. PubMed, preprints, grants), translational (e.g. clinicaltrials.gov), and other (e.g. patents) biomedical knowledge bases, a fundamental requirement is to recognize nuanced scientific phraseology and measure the strength of association between all possible pairs of such phrases. Such a holistic map of associations will provide insights into the knowledge harbored in the world’s biomedical literature.
While unsupervised machine learning has been advanced to study the semantic relationships between word embeddings (Mikolov et al., 2013a; LeCun et al., 2015) and applied to the material science corpus (Tshitoyan et al., 2019), this has not been scaled-up to extract the ‘global context’ of conceptual associations from the entirety of publicly available unstructured biomedical text. Additionally, a principled way of accounting for the distances between phrases captured from the ever-growing scientific literature has not been comprehensively researched to quantify the strength of ‘local context’ between pairs of biological concepts. Given the propensity for irreproducible or erroneous scientific research (Nature Editorial, 2016), any local or global signals extracted from this unstructured knowledge need to be seamlessly triangulated with deep biological insights emergent from various omics data silos.
The nferX software is a cloud-based platform that enables users to dynamically query the universe of possible conceptual associations from over 100 million biomedical documents, including the COVID-19 Open Research Dataset recently announced by the White House (The White House, 2020; Figure 1). An unsupervised neural network is used to recognize and preserve complex biomedical phraseology as 300 million searchable tokens, beyond the simpler words that have generally been explored using higher dimensional word embeddings previously (Mikolov et al., 2013a). Our local context score is derived from pointwise mutual information content between pairs of these tokens and can be retrieved dynamically. Our global context score is derived using word2vec (Mikolov et al., 2013a), as the cosine similarity between 180 million word vectors projected in a 300 dimensional space (Figure 1A, Figure 1—figure supplement 1).
Figure 1.  Knowledge synthesis and the nferX Single Cell resource.
(A) Knowledge synthesis: capturing association between concepts from over 100 million documents. Schematic shows the workflow for generating literature-derived associations between phrases. Local score and global score are defined and the types of literature-derived associations are shown for combinations of high and low local and global scores. (B) Datasets enabling knowledge synthesis-powered scRNAseq analysis platform (https://academia.nferx.com/). Single-cell RNAseq data was obtained from publicly available human and mouse single-cell RNA-seq datasets. Bulk RNA-seq data was obtained from Gene Expression Omnibus (GEO) and the Genotype Tissue Expression (GTEx) project portal. Protein-level expression of coronavirus receptors was assessed using a collection of immunohistochemistry (IHC) images and tissue proteomics datasets from the Human Protein Atlas and the Human Proteome Map. Literature-derived association scores are obtained from over 100 million biomedical documents (C) Highlighting selected tissues and cell types identified by one or more modalities to express ACE2, the putative receptor of SARS-CoV-2 spike protein. Image template: https://www.proteomicsdb.org/.
Figure 1—figure supplement 1.  Validation of metrics used to assess literature-derived associations.
A)-(B) 1-d logistic model predicting true vs random concept pair associations from (A) cosine score and (B) Exponential Local Score. Extent of separation between the green true associations and red random associations indicates extent to which score captures known concept associations. (C) Normalized histograms of co-occurrence counts (on logarithmic scale) for high-cosine vs low-cosine token pairs. (D) Distribution of cosines between gene-disease token vector pairs vs null distributions of cosines between pairs of random 300-d vectors. (E) Null cosine distribution between two random vectors, as the dimension of the vectors varies. In order to assess the veracity of these conceptual associations derived from biomedical literature, it is absolutely essential to enable triangulation with structured data sources including gene and protein expression datasets. To address this need and empower the scientific community, we built a Single Cell RNA-seq (scRNAseq) resource (https://academia.nferx.com/) which harnesses these local and global score metrics to enable seamless integration of literature-derived associations with the analysis of transcriptomes from over 2.2 million individual cells from over 50 human and mouse tissue-types (Figure 1B). Here, we use this first-in-class resource to conduct a comprehensive expression profiling of ACE2 across host tissues and cell types and discuss how the observed expression patterns correlate with the pathogenicity and viral transmission shaping the ongoing COVID-19 pandemic (Figure 1C).