PubAnnotation

Id	Subject	Object	Predicate	Lexical cue
T308	0-21	Sentence	denotes	Materials and methods
T309	23-97	Sentence	denotes	Unstructured biomedical knowledge synthesis and triangulation capabilities
T310	98-281	Sentence	denotes	In order to capture biomedical literature-based associations, the nferX platform defines two scores: a ‘local score’ and a ‘global score’, as described previously (Park et al., 2020).
T311	282-618	Sentence	denotes	Briefly, the local score is obtained from applying a traditional natural language processing technique which captures the strength of association between two concepts in a selected corpus of biomedical literature based on the frequency of their co-occurrence normalized by the frequency of each individual concept throughout the corpus.
T312	619-809	Sentence	denotes	A higher local score between Concept X and Concept Y indicates that these concepts are frequently mentioned in close proximity to each other more frequently than would be expected by chance.
T313	810-957	Sentence	denotes	The global score, on the other hand, is based on the neural network renaissance that has recently taken place in Natural Language Processing (NLP).
T314	958-1088	Sentence	denotes	To compute global scores, all tokens (e.g. words and phrases) are projected in a high-dimensional vector space of word embeddings.
T315	1089-1188	Sentence	denotes	These vectors serve to represent the ‘neighborhood’ of concepts which occur around a given concept.
T316	1189-1414	Sentence	denotes	The cosine similarity between any two vectors measures the similarity of these neighborhoods and is the basis for our global score metric, where concepts which are more similar in this vector space have a higher global score.
T317	1415-1655	Sentence	denotes	While the global scores in this work are computed in the embedding space of word2vec model, it can also be computed in the embedding space of any deep learning model including recent transformer-based models like BERT (Devlin et al., 2019).
T318	1656-1817	Sentence	denotes	These may have complementary benefits to word2vec embeddings since the embeddings are context sensitive having different vectors for different sentence contexts.
T319	1818-2056	Sentence	denotes	However, despite the context sensitive nature of BERT embeddings a global score computation for a phrase may still be of value given the score is computed across sentence embeddings capturing the context sensitive nature of those phrases.
T320	2057-2283	Sentence	denotes	From a visualization perspective, the local score and global score (‘Signals’) are represented in the platform using bubbles where bubble size corresponds to the local score and color intensity corresponds to the global score.
T321	2284-2409	Sentence	denotes	This allows users to rapidly determine the strength of association between any two concepts throughout biomedical literature.
T322	2410-2568	Sentence	denotes	We consider concepts which show both high local and global scores to be ‘concordant’ and have found that these typically recapitulate well-known associations.
T323	2569-2722	Sentence	denotes	One key aspect of the nferX platform is that it allows the user to query associated concepts for a virtually unbounded number of possible query concepts.
T324	2723-2765	Sentence	denotes	This is achieved by means of two features:
T325	2766-3018	Sentence	denotes	Firstly, the nferX platform allows the user to compose queries using the logical AND, OR and NOT operators to logically combine any number of biomedical concepts in a query, each combination amounting to a gross or nuanced composite biomedical concept.
T326	3019-3521	Sentence	denotes	Secondly, since logical combinations yield a virtually unbounded number of biomedical concepts that can be queries, the nferX platform implements a completely dynamic method of computing local scores on the fly by using novel high performance parallel and distributed algorithms that, in real time, scan hundreds of millions of documents to quickly locate user query related text fragments and count co-occurring biomedical concepts for computing strength of association scores and their significances.
T327	3522-3870	Sentence	denotes	The platform further leverages statistical inference to calculate ‘enrichments’ based on structured data, thus enabling real-time triangulation of signals from the unstructured biomedical knowledge graph various other structured databases (e.g. curated ontologies, RNA-sequencing datasets, human genetic associations, protein-protein interactions).
T328	3871-4042	Sentence	denotes	This facilitates unbiased hypothesis-free learning and faster pattern recognition, and it allows users to more holistically determine the veracity of concept associations.
T329	4043-4249	Sentence	denotes	Finally, the platform allows the user to identify and further examine the documents and textual fragments from which the knowledge synthesis signals are derived using the Documents and Signals applications.
T330	4251-4269	Sentence	denotes	Association scores
T331	4270-4602	Sentence	denotes	Having a method that automatically consumes a corpus and computes a numeric score that captures the strength of the association between any pair of entities is obviously beneficial because then given any entity, its association strength score with all other entities can be sorted to find a sorted list of other associated entities.
T332	4603-4727	Sentence	denotes	The number of times two entities mutually co-occur in ‘small’ vicinities of a corpus is the basis of all association scores.
T333	4728-4911	Sentence	denotes	One popular traditional measure for association strength between tokens in text is pointwise mutual information, or PMI (Evert, 2005), which we consider in several association scores.
T334	4913-4936	Sentence	denotes	Measures of association
T335	4937-5067	Sentence	denotes	Formally, an association score is some real-valued function S(q, t) where q is a query token/entity and t is another token/entity.
T336	5068-5151	Sentence	denotes	One important notion, the ‘vicinity’ of q, we formally denote as the Context of q :
T337	5152-5231	Sentence	denotes	The context of q are those corpus segments deemed to be ‘near’ or ‘local’ to q.
T338	5232-5513	Sentence	denotes	For single token queries (where q is a single entity and not a logical combination of entities) , q’s context consists of all corpus segments that are ‘windows’ formed by taking words within a distance w (usually a tunable parameter) of words from an occurrence of q in the corpus.
T339	5514-5706	Sentence	denotes	The dynamic adjacency engine generalizes this notion of context in a natural way to logical queries: the context for a logical q can be generalized as a certain set of fixed-length fragments.
T340	5708-5722	Sentence	denotes	Co-occurrences
T341	5723-5786	Sentence	denotes	This is just the number of times t appears in the context of q.
T342	5788-5803	Sentence	denotes	Traditional PMI
T343	5804-5831	Sentence	denotes	This is log(p(t \| q)/p(t)).
T344	5832-6088	Sentence	denotes	Here p(t \| q) is the number of times t occurs in the context of q (ie co-occurrences of t and q) divided by the total length of all q contexts in the corpus, whereas p(t) is the number of occurrences of t in the entire corpus, divided by the corpus length.
T345	6090-6116	Sentence	denotes	Word2vec cosine similarity
T346	6117-6260	Sentence	denotes	The popular word2vec algorithm (Raj et al., 2013) generates a vector (we use 300-dimensional vector representation) for each token in a corpus.
T347	6261-6348	Sentence	denotes	The purpose of these vectors is usually to be used as features in downstream NLP tasks.
T348	6349-6390	Sentence	denotes	But they can also be used for similarity.
T349	6391-6556	Sentence	denotes	The original paper validates the vectors by testing them on word similarity tasks: the association score is the cosine between the vector for q and the vector for t.
T350	6557-6599	Sentence	denotes	This score only applies to single-token q.
T351	6601-6630	Sentence	denotes	Exponential mask PMI (ExpPMI)
T352	6631-6668	Sentence	denotes	This is our first new proposed score.
T353	6669-6751	Sentence	denotes	PMI treats every position in a binary way: it’s either in the context of q or not.
T354	6752-6902	Sentence	denotes	With a window size of say 50, a token which appears three words from a query q and a token which appears 45 words from a query q are treated the same.
T355	6903-7075	Sentence	denotes	We thought it might be useful to consider a measure which distinguishes positions in the context based on the number of words away that position is from an occurrence of q.
T356	7076-7161	Sentence	denotes	We did this by weighting the positions in the context by some weight between 0 and 1.
T357	7162-7299	Sentence	denotes	Our weighting is based on an exponential decay (which has some nice properties especially when we extend to the case of logical queries).
T358	7301-7312	Sentence	denotes	Local score
T359	7313-7348	Sentence	denotes	This is another new proposed score.
T360	7349-7462	Sentence	denotes	We find that PMI and ExpPMI can vary a lot for small samples (i.e. small numbers of co-occurrences, occurrences).
T361	7463-7602	Sentence	denotes	The Local Score is log(coocc) * sigmoid(PMI - 0.5), constructed to correct for this; we found that this formula too works well empirically.
T362	7604-7648	Sentence	denotes	Exponential mask local score (ExpLocalScore)
T363	7649-7761	Sentence	denotes	We apply both modifications together: the exponential mask score is log(weighted_coocc) * sigmoid(expPMI - 0.5).
T364	7762-7839	Sentence	denotes	Here weighted_coocc is the sum of the weights of the positions of the corpus.
T365	7841-7892	Sentence	denotes	Evaluation of literature-derived association scores
T366	7893-7974	Sentence	denotes	We need a notion of ground truth to evaluate the quality of association measures.
T367	7975-8095	Sentence	denotes	We use sets of known pairs of related entities versus a ‘control’ group of random pairs of entities of the same classes.
T368	8096-8139	Sentence	denotes	We use a few different sets of known pairs:
T369	8140-8200	Sentence	denotes	Disease-Gene relationships based on OMIM (Park et al., 2020)
T370	8201-8234	Sentence	denotes	Drug-Gene relationships (Table 1)
T371	8235-8281	Sentence	denotes	Drug-Disease relationships based on FDA labels
T372	8282-8318	Sentence	denotes	Drugs and their on-label indications
T373	8319-8358	Sentence	denotes	Drugs and their on-label adverse events
T374	8359-8395	Sentence	denotes	Logical queries for ambiguous tokens
T375	8396-8525	Sentence	denotes	One demonstration of the use of the logical query system is to disambiguate a token by conjoining it with a disambiguating token.
T376	8526-8699	Sentence	denotes	An example is clearer: the token ‘egfr’ can refer to the gene entity epidermal growth factor receptor, but also the test measure entity estimated glomerular filtration rate.
T377	8700-8819	Sentence	denotes	A query ‘egfr AND kidney’ should return results related to the latter meaning, while ‘egfr AND lung_cancer’ the former.
T378	8820-8917	Sentence	denotes	In particular, an unambiguous referent to the right entity should be highly related to the query.
T379	8918-9083	Sentence	denotes	So example known pairs in this data are (‘egfr AND kidney’, ‘estimated_glomerular_filtration_rate’) and (‘egfr AND lung_cancer’, ‘epidermal_growth_factor_receptor’).
T380	9084-9188	Sentence	denotes	We used an internal set of ~200–300 such (‘A AND B’, ‘C’) pairs (originally built up for other reasons).
T381	9189-9194	Sentence	denotes	Note:
T382	9195-9521	Sentence	denotes	One key drawback of the word2vec vector cosine similarity (Park et al., 2020; Mikolov et al., 2013b) method is its inability to get scores for logical queries as described above, because the method (Mikolov et al., 2013b) does not address the question of how to get vectors for queries that are logical combinations of tokens.
T383	9523-9541	Sentence	denotes	Evaluation metrics
T384	9542-9706	Sentence	denotes	Given a scoring method and a particular set of positive/control pairs, we get two sets of scores: one set for the positive pairs and one set for the negative pairs.
T385	9707-9717	Sentence	denotes	Cohen’s d:
T386	9718-9822	Sentence	denotes	We compute the Cohen’s d standard statistical measure of distance between two samples (Cohen’s D, 2016).
T387	9823-10010	Sentence	denotes	Mann-Whitney U (normalized): - The Mann-Whitney U is a nonparametric measure of distribution distance: it counts the number of transposed pairs (Contributors to Wikimedia projects, 2004).
T388	10012-10058	Sentence	denotes	Metrics based on training a 1-d logistic model
T389	10059-10171	Sentence	denotes	In this test, we are discriminating between two classes (true association/non-association) based on one feature.
T390	10172-10283	Sentence	denotes	We have two metrics based on fitting a 1-feature logistic curve to the data. (Figure 1—figure supplement 1A–B).
T391	10284-10296	Sentence	denotes	Brier score:
T392	10297-10538	Sentence	denotes	The Brier score is the average squared error of the logistic curve above: that is, for each labeled point, we square the vertical distance to the logistic curve, and average over all labeled points (Contributors to Wikimedia projects, 2005).
T393	10539-10567	Sentence	denotes	Log loss (dansbecker, 2018):
T394	10568-10667	Sentence	denotes	The logistic log loss is the average -log [model probability of true label] for each labeled point.
T395	10668-10724	Sentence	denotes	If the model is perfect at the point, it incurs no loss.
T396	10725-10770	Sentence	denotes	If it predicts 0.5, it incurs -log[0.5] loss.
T397	10771-10933	Sentence	denotes	If it predicts ‘yes’ with certainty when the answer is ‘no’ it incurs infinite loss (a logistic function never touches 0 or one so this won’t happen in our case).
T398	10934-10953	Sentence	denotes	Neg log percentile:
T399	10954-11040	Sentence	denotes	For most of the scoring rules, we also include a -log(percentile) version of the rule.
T400	11041-11113	Sentence	denotes	This is constructed as follows, for query q, token t, and score S(q, t):
T401	11114-11168	Sentence	denotes	Compute the scores S(q, t’) for q with every token t’.
T402	11169-11215	Sentence	denotes	Let R be the number of these that are nonzero.
T403	11216-11270	Sentence	denotes	Take the rank r of S(q, t) among all nonzero S(q, t’).
T404	11271-11340	Sentence	denotes	The neg log percentile score nlS(q, t) associated with S is -log(r/R)
T405	11341-11355	Sentence	denotes	We do this to:
T406	11356-11394	Sentence	denotes	control for differences across queries
T407	11395-11504	Sentence	denotes	control for differences in the shapes of the distributions that different association scoring functions take.
T408	11505-11576	Sentence	denotes	This procedure maps all the S(q, t’) to an Exponential(1) distribution.
T409	11577-11718	Sentence	denotes	We chose Exponential(1) because it is simple, intuitively reasonable and many of the scores naturally seemed to be approximately exponential.
T410	11720-11804	Sentence	denotes	High-dimensional word embeddings for determining the significant global associations
T411	11805-12214	Sentence	denotes	Figure 1—figure supplement 1C illustrates two histograms generated from a random set of vectors (in the vector space generated by the Neural Network) where one distribution represents all vector pairs whose cosine similarity is less than 0.32 (deemed ‘not strong associations’) and the other distribution represents all vector pairs whose cosine similarity is greater than 0.32 (deemed ‘strong associations’).
T412	12215-12375	Sentence	denotes	This can show how common a phenomenon it is to find word vector pairs that have very good cosine similarity values but yet not co-occur even once in the corpus.
T413	12376-12588	Sentence	denotes	The ‘cosine similarity >= 0.32’ bar at zero value suggests that roughly 11% of vector pairs whose cosine similarity where greater than 0.32 (‘strong associations’) never occurred together even once in a document.
T414	12589-12937	Sentence	denotes	It is also clear from the figure that albeit more of the mass of the ‘cosine similarity >= 0.32’ distribution is skewed to the right as expected (more co-occurrences and hence unsurprisingly larger cosine similarity values), there is a long tail of the ‘cosine similarity < 0.32’ distribution (very high co-occurrences but small cosine similarity).
T415	12938-13157	Sentence	denotes	The long tail is a direct consequence of negative sampling—where vectors corresponding to common words that co-occur quite often with significant words in a sliding window are moved away from vectors of the other words.
T416	13159-13252	Sentence	denotes	What does the word2vec neural network do from the perspective of Genes-Diseases associations?
T417	13253-13571	Sentence	denotes	One way to view the word2vec ‘black box’ operation from a Genes/Diseases perspective (cosine of <Gene, Disease> for all Genes and Diseases) is as a Transfer Function which changed the input probability distribution (pre-training randomly assigned word vectors for Genes and Diseases) to a new probability distribution.
T418	13572-13780	Sentence	denotes	The ‘null hypothesis’ (which seems to be well preserved in actuality in the way word2vec assigns random values to vectors initially) is the ‘green colored’ Cosine Distribution (Figure 1—figure supplement 1D).
T419	13781-14034	Sentence	denotes	Once word2vec training is over, the final word vectors are placed in specific positions in the 300-dimensional space so as to present the ‘blue colored’ Empirical distribution (the actual cosine similarity between <Gene, Disease> pairs that we observe).
T420	14035-14219	Sentence	denotes	The ‘orange curve’ is the 2-Gamma mixture (the parametric distribution that captures the ‘empirical distribution’ with just eight parameters (two alphas, two betas, 2 ts and two phis).
T421	14220-14252	Sentence	denotes	Observations from this analysis:
T422	14253-14361	Sentence	denotes	Note the ‘symmetrical’ cosine distribution after training becomes ‘Asymmetrical’ with a longer ‘right tail’.
T423	14362-14465	Sentence	denotes	The asymmetry is the reason why Gamma distribution worked better than say, Gaussian, for the curve fit.
T424	14466-14945	Sentence	denotes	The mean of the distribution gets shifted to the right after training as one would expect — the vectors during training are ‘brought together’ by parallelogram addition predominantly— explaining the shift to the right (negative sampling will cause a movement in the opposite direction, but that will disproportionately affect the ‘ultra-high frequency’ words, which get ‘more’ positively sampled and hence the 3-gamma with a bump near 0.6 happens for ultra-high frequency words).
T425	14946-15032	Sentence	denotes	The most interesting associations, by definition, are in the tail of the distribution.
T426	15034-15179	Sentence	denotes	What does varying the number of dimensions in the word2vec space do to the underlying cosine similarity distributions in a large textual corpora?
T427	15180-15395	Sentence	denotes	Figure 1—figure supplement 1E illustrates a cosine similarity probability density function (PDF) graph to visually describe the implementation of the word2vec-like Vector Space Model in various N-dimensional spaces.
T428	15396-15756	Sentence	denotes	As described in the Materials and methods section, the system is a Semantic Bio-Knowledge Graph of nodes representing the words/phrases chosen to be represented as vectors and edge weights determined by measures of Semantic Association Strength (e.g. the cosine similarity between a pair of word embeddings represented as vectors in a large dimensional space).
T429	15757-15874	Sentence	denotes	The cosine similarity ranges from 0 (representing no semantic association) to 1 (representing strongest association).
T430	15875-15982	Sentence	denotes	This metric of association can reflect the contextual similarity of the entities in the Biomedical Corpora.
T431	15983-16092	Sentence	denotes	The typical dimensionality used by our neural network for generating the Global Scores is n = 300 dimensions.
T432	16093-16307	Sentence	denotes	This is because, as can be seen in the graph, the distribution is highly peaked with most of the mass centered around 0 -- that is, a randomly chosen pair of vectors typically are orthogonal or close to orthogonal.
T433	16308-16453	Sentence	denotes	Furthermore, over 300 dimensions, the distributions all have sufficiently long tails with the most interesting (salient) biomedical associations.
T434	16455-16492	Sentence	denotes	Single-cell RNA-seq analysis platform
T435	16493-16618	Sentence	denotes	The objective of the single cell platform is to enable dynamic visualization and analysis of single-cell RNA-sequencing data.
T436	16619-16948	Sentence	denotes	Currently, there are over 30 scRNAseq studies available for analysis in the Single Cell app, including studies from human donors/patients covering tissues such as adipose tissue, blood, bone marrow, colon, esophagus, liver, lung, kidney, ovary, nasal epithelium, pancreas, placenta, prostate, retina, small intestine, and spleen.
T437	16949-17101	Sentence	denotes	Because no pan-tissue reference dataset yet exists for humans, we have manually selected individual studies to maximally cover the set of human tissues.
T438	17102-17269	Sentence	denotes	In some cases, these studies contain cells from both healthy donors and patients affected by a specified pathology such as ulcerative colitis (colon) or asthma (lung).
T439	17270-17621	Sentence	denotes	There are also a number of murine scRNAseq studies covering tissues including adipose tissue, airway epithelium, blood, bone marrow, brain, breast, colon, heart, kidney, liver, lung, ovary, pancreas, placenta, prostate, skeletal muscle, skin, spleen, stomach, small intestine, testis, thymus, tongue, trachea, urinary bladder, uterus, and vasculature.
T440	17622-17721	Sentence	denotes	Note that two of these murine studies (Tabula Muris and Mouse Cell Atlas) include ~20 tissues each.
T441	17723-17759	Sentence	denotes	Single-cell data processing pipeline
T442	17760-17944	Sentence	denotes	For each study, a counts matrix was downloaded from a public data repository such as the Gene Expression Omnibus (GEO) or the Broad Institute Single Cell Portal (Supplementary file 1).
T443	17945-18154	Sentence	denotes	Note that this data has not been re-processed from the raw sequencing output, and so it is likely that alignment and quantification of gene expression was performed using different tools for different studies.
T444	18155-18248	Sentence	denotes	In some cases, multiple complementary datasets have been generated from a single publication.
T445	18249-18328	Sentence	denotes	In these cases, we have generated separate entries in the Single Cell platform.
T446	18329-18337	Sentence	denotes	Table 1.
T447	18339-18361	Sentence	denotes	Results of evaluation.
T448	18362-18415	Sentence	denotes	Performance of approximately 2100 disease-gene pairs.
T449	18416-18512	Sentence	denotes	Assoc score↓ Cohen’s d (+) Mann-W U norm. (-) Logistic log loss (-) Logistic Brier score (-)
T450	18513-18551	Sentence	denotes	Cosine (w2v) 1.31 0.197 0.51 0.168
T451	18552-18587	Sentence	denotes	Raw PMI 2.07 0.0953 0.374 0.116
T452	18588-18636	Sentence	denotes	Raw PMI -log(pctile) 2.15 0.0947 0.355 0.111
T453	18637-18672	Sentence	denotes	Exp PMI 2.17 0.0897 0.356 0.109
T454	18673-18721	Sentence	denotes	Exp PMI -log(pctile) 2.21 0.0903 0.341 0.105
T455	18722-18766	Sentence	denotes	Raw Local Score 2.35 0.0828 0.312 0.0947
T456	18767-18824	Sentence	denotes	Raw Local Score -log(pctile) 2.28 0.0832 0.317 0.0963
T457	18825-18873	Sentence	denotes	Exp Local Score 2.34 0.0812 0.301 0.0915
T458	18874-18934	Sentence	denotes	Exp Local Score -log(pctile) 2.36 0.0811 0.308 0.093
T459	18935-18972	Sentence	denotes	log(coocc) 2.24 0.097 0.348 0.105
T460	18973-19007	Sentence	denotes	Interpretation of the above table.
T461	19008-19118	Sentence	denotes	Each row corresponds to an association score whereas each column corresponds to one of the evaluation metrics.
T462	19119-19264	Sentence	denotes	A (+) in the column means a higher evaluation metric value, the better the association score in that row separates the positive and random pairs.
T463	19265-19313	Sentence	denotes	A (-) means a lower evaluation metric is better.
T464	19314-19415	Sentence	denotes	Note all the metrics are immune to linear rescalings; also the Mann-Whitney U score is nonparametric.
T465	19416-19754	Sentence	denotes	While counts matrices have been generated using different technologies (e.g. Drop-Seq, 10x Genomics, etc.) and different alignment/pre-processing pipelines, all counts matrices were scaled such that each cell contains a total of 10,000 scaled counts (i.e. the sum of expression values for all genes equals 10,000 in each individual cell).
T466	19755-19839	Sentence	denotes	All data were uniformly processed using the Seurat v3 package (Butler et al., 2018).
T467	19840-19893	Sentence	denotes	In short, this pipeline involves the following steps.
T468	19894-20045	Sentence	denotes	First, we identify 2000 variable genes across the given dataset and then perform linear dimensionality reduction by principal component analysis (PCA).
T469	20046-20409	Sentence	denotes	Using the set of principal components which contribute >80% of variance across the dataset, we then do the following: (i) perform graph-based clustering to identify groups of cells with similar expression profiles (Louvain clustering), (ii) compute UMAP and tSNE coordinates for each individual cell (used for data visualization) and (iii) annotate cell clusters.
T470	20410-20665	Sentence	denotes	Note that the three human pancreatic datasets (GSE81076, GSE85241, GSE86469) were integrated together in a shared multi-dimensional space using CCA (Canonical Correlation Analysis) and the integration method in the Seurat v3 package (Butler et al., 2018).
T471	20666-20780	Sentence	denotes	Cell clustering and computation of dimensionality reduction coordinates were performed on this integrated dataset.
T472	20782-20805	Sentence	denotes	Cell cluster annotation
T473	20806-21031	Sentence	denotes	In cases where publicly deposited counts matrices are accompanied by author-assigned annotations for individual cells or clusters, we have retained these cell annotations for display in the platform and accompanying analyses.
T474	21032-21416	Sentence	denotes	For any study which was not accompanied by a metadata file containing cluster annotations, we have manually labeled clusters based on sets of canonical ‘cluster-defining genes.’ In these cases, we have attempted to leverage annotations and descriptions of gene expression patterns described by study authors in the manuscript text and figures corresponding to the data being analyzed.
T475	21418-21468	Sentence	denotes	Metrics to summarize cluster-level gene expression
T476	21469-21535	Sentence	denotes	The platform allows users to query any gene in any selected study.
T477	21536-21683	Sentence	denotes	The corresponding data is displayed in commonly employed formats including a series of violin plots and as a set of dimensionality reduction plots.
T478	21684-21835	Sentence	denotes	Expression is summarized by listing the percent of cells expressing Gene G in each annotated cluster and the mean expression of Gene G in each cluster.
T479	21836-22070	Sentence	denotes	To measure the specificity of Gene G expression to each Cluster C, we compute a Cohen’s D value which assesses the effect size between the mean expression of Gene G in cluster C and the mean expression of Gene G in all other clusters.
T480	22071-22283	Sentence	denotes	Specifically, the Cohen’s D formula is given as follows: (MeanC - MeanA)/(sqrt(StDevC2 + StDevA2)) , where C represents the cluster of interest and A represents the complement of C (i.e. all other cell clusters).
T481	22284-22461	Sentence	denotes	Note that this is functionally similar to the computation of paired fold change values and p-values between clusters which is frequently used to identify cluster-defining genes.
T482	22463-22490	Sentence	denotes	Gene-gene cosine similarity
T483	22491-22641	Sentence	denotes	Within the platform, we support the run-time computation of cosine similarity (i.e. 1 - cosine distance) between the queried gene and all other genes.
T484	22642-22768	Sentence	denotes	This provides a measure of expression similarity across cells and can be used to identify co-regulated and co-expressed genes.
T485	22769-22868	Sentence	denotes	Specifically, to perform this computation, we construct a ‘gene expression vector’ for each gene G.
T486	22869-23000	Sentence	denotes	This corresponds to the set of CP10K values for gene G in each individual cell from the selected populations in the selected study.
T487	23002-23071	Sentence	denotes	Profiling expression of coronavirus receptors in single-cell datasets
T488	23072-23163	Sentence	denotes	For each single-cell dataset, we examined the expression of ACE2, TMPRSS2, ANPEP, and DPP4.
T489	23164-23318	Sentence	denotes	We generally considered a cell population to potentially express a gene if at least 5% of cells from that cluster showed non-zero expression of this gene.
T490	23319-23549	Sentence	denotes	For each dataset, we show a figure which includes a UMAP dimensionality reduction plot colored by annotated cell type along with identical plots colored by the expression level of each coronavirus receptor in all individual cells.
T491	23550-23805	Sentence	denotes	In some cases, we also show violin plots from the platform which automatically integrate literature-derived insights to highlight whether there exist textual associations between the queried gene and the tissue/cell types identified in the selected study.
T492	23807-23858	Sentence	denotes	FDA Adverse Event Reporting System (FAERS) analysis
T493	23859-24075	Sentence	denotes	The FAERS application of the nferX platform supports viewing adverse event profiles of all marketed products through multiple lenses - Count, Proportional Reporting Ratio (PRR), and an nferX Adverse Event (AE) Score.
T494	24076-24184	Sentence	denotes	AEScore=ln(count)∗1/(1+e−(prr−1.5)). Count is the raw number of reports between a drug and an adverse event.
T495	24185-24373	Sentence	denotes	The proportional reporting ratio (PRR) is a simple way to get a measure of how common an adverse event for a particular drug is compared to how common the event is in the overall database.
T496	24374-24732	Sentence	denotes	A PRR >1 for a drug-event combination indicates that a greater proportion of the reports for the drug are for the event than the proportion of events in the rest of the database, while a PRR of 2 for a drug event combination indicates that the proportion of reports for the drug-event combination is twice the proportion of the event in the overall database.
T497	24733-24787	Sentence	denotes	The PRR is computed as follows:PRR=(m/n)/((M−m)/(N−n))
T498	24788-24829	Sentence	denotes	m = number of reports with drug and event
T499	24830-24861	Sentence	denotes	n = number of reports with drug
T500	24862-24906	Sentence	denotes	M = number of reports with event in database
T501	24907-24940	Sentence	denotes	N = number of reports in database
T502	24941-25016	Sentence	denotes	Count of an event with a query drug is a good first measure of association.
T503	25017-25217	Sentence	denotes	But it has the problem that generally common events will often show up at the top, where we are often more interested in events that are differentially associated with the query drug over other drugs.
T504	25218-25304	Sentence	denotes	An issue with PRR is that it is noisy when the total number of event reports is small.
T505	25305-25701	Sentence	denotes	If there are three reports of some oddly specific event and one occurs with the query drug, that event will likely have a very high PRR, but it may not be the event we would be most interested in for a drug (in FAERS such rare events are often not even proper adverse events) - we want events that occur often, and also are differentially associated with a drug - a balance between count and PRR.
T506	25702-25769	Sentence	denotes	The AE score tries to strike this balance in an all-in-one measure.
T507	25770-25943	Sentence	denotes	It up-weights events that occur often for the query drug (this is the ln(count) term), and that are differentially associated with the query drug (this is the sigmoid term).
T508	25944-25998	Sentence	denotes	The sigmoid(PRR-1.5) term ranges smoothly from 0 to 1.
T509	25999-26030	Sentence	denotes	It's equal to 0.5 at PRR = 1.5.
T510	26031-26140	Sentence	denotes	When PRR = 6, sigmoid(PRR-1.5)=0.99; so PRR values >= 6 are all treated roughly equivalently by the AE score.
T511	26141-26361	Sentence	denotes	Thus, extremely high PRRs due to small counts will not swing the AE score much beyond PRR = 6, and the ln(count) term will down-weight those small-count cases, so that they do not show up at the top of the AE score list.
T512	26362-26548	Sentence	denotes	A nice property of AE score is that, for a given query drug, the AE scores of the events with that drug turn out to roughly follow an exponential distribution, particularly at the tails.
T513	26549-26623	Sentence	denotes	We can then fit exponential distributions to the scores, and analyze them.
T514	26624-26833	Sentence	denotes	A benefit of the exponential fit is that we can make more robust claims about how significant a certain score is for a query drug, even if the empirical data is sparse/noisy at the tails for a particular drug.

T308

0-21

Sentence

denotes

Materials and methods

T309

23-97

Sentence

denotes

Unstructured biomedical knowledge synthesis and triangulation capabilities

T310

98-281

Sentence

denotes

In order to capture biomedical literature-based associations, the nferX platform defines two scores: a ‘local score’ and a ‘global score’, as described previously (Park et al., 2020).

T311

282-618

Sentence

denotes

Briefly, the local score is obtained from applying a traditional natural language processing technique which captures the strength of association between two concepts in a selected corpus of biomedical literature based on the frequency of their co-occurrence normalized by the frequency of each individual concept throughout the corpus.

T312

619-809

Sentence

denotes

A higher local score between Concept X and Concept Y indicates that these concepts are frequently mentioned in close proximity to each other more frequently than would be expected by chance.

T313

810-957

Sentence

denotes

The global score, on the other hand, is based on the neural network renaissance that has recently taken place in Natural Language Processing (NLP).

T314

958-1088

Sentence

denotes

To compute global scores, all tokens (e.g. words and phrases) are projected in a high-dimensional vector space of word embeddings.

T315

1089-1188

Sentence

denotes

These vectors serve to represent the ‘neighborhood’ of concepts which occur around a given concept.

T316

1189-1414

Sentence

denotes

The cosine similarity between any two vectors measures the similarity of these neighborhoods and is the basis for our global score metric, where concepts which are more similar in this vector space have a higher global score.

T317

1415-1655

Sentence

denotes

While the global scores in this work are computed in the embedding space of word2vec model, it can also be computed in the embedding space of any deep learning model including recent transformer-based models like BERT (Devlin et al., 2019).

T318

1656-1817

Sentence

denotes

These may have complementary benefits to word2vec embeddings since the embeddings are context sensitive having different vectors for different sentence contexts.

T319

1818-2056

Sentence

denotes

However, despite the context sensitive nature of BERT embeddings a global score computation for a phrase may still be of value given the score is computed across sentence embeddings capturing the context sensitive nature of those phrases.

T320

2057-2283

Sentence

denotes

From a visualization perspective, the local score and global score (‘Signals’) are represented in the platform using bubbles where bubble size corresponds to the local score and color intensity corresponds to the global score.

T321

2284-2409

Sentence

denotes

This allows users to rapidly determine the strength of association between any two concepts throughout biomedical literature.

T322

2410-2568

Sentence

denotes

We consider concepts which show both high local and global scores to be ‘concordant’ and have found that these typically recapitulate well-known associations.

T323

2569-2722

Sentence

denotes

One key aspect of the nferX platform is that it allows the user to query associated concepts for a virtually unbounded number of possible query concepts.

T324

2723-2765

Sentence

denotes

This is achieved by means of two features:

T325

2766-3018

Sentence

denotes

Firstly, the nferX platform allows the user to compose queries using the logical AND, OR and NOT operators to logically combine any number of biomedical concepts in a query, each combination amounting to a gross or nuanced composite biomedical concept.

T326

3019-3521

Sentence

denotes

Secondly, since logical combinations yield a virtually unbounded number of biomedical concepts that can be queries, the nferX platform implements a completely dynamic method of computing local scores on the fly by using novel high performance parallel and distributed algorithms that, in real time, scan hundreds of millions of documents to quickly locate user query related text fragments and count co-occurring biomedical concepts for computing strength of association scores and their significances.

T327

3522-3870

Sentence

denotes

The platform further leverages statistical inference to calculate ‘enrichments’ based on structured data, thus enabling real-time triangulation of signals from the unstructured biomedical knowledge graph various other structured databases (e.g. curated ontologies, RNA-sequencing datasets, human genetic associations, protein-protein interactions).

T328

3871-4042

Sentence

denotes

This facilitates unbiased hypothesis-free learning and faster pattern recognition, and it allows users to more holistically determine the veracity of concept associations.

T329

4043-4249

Sentence

denotes

Finally, the platform allows the user to identify and further examine the documents and textual fragments from which the knowledge synthesis signals are derived using the Documents and Signals applications.

T330

4251-4269

Sentence

denotes

Association scores

T331

4270-4602

Sentence

denotes

Having a method that automatically consumes a corpus and computes a numeric score that captures the strength of the association between any pair of entities is obviously beneficial because then given any entity, its association strength score with all other entities can be sorted to find a sorted list of other associated entities.

T332

4603-4727

Sentence

denotes

The number of times two entities mutually co-occur in ‘small’ vicinities of a corpus is the basis of all association scores.

T333

4728-4911

Sentence

denotes

One popular traditional measure for association strength between tokens in text is pointwise mutual information, or PMI (Evert, 2005), which we consider in several association scores.

T334

4913-4936

Sentence

denotes

Measures of association

T335

4937-5067

Sentence

denotes

Formally, an association score is some real-valued function S(q, t) where q is a query token/entity and t is another token/entity.

T336

5068-5151

Sentence

denotes

One important notion, the ‘vicinity’ of q, we formally denote as the Context of q :

T337

5152-5231

Sentence

denotes

The context of q are those corpus segments deemed to be ‘near’ or ‘local’ to q.

T338

5232-5513

Sentence

denotes

For single token queries (where q is a single entity and not a logical combination of entities) , q’s context consists of all corpus segments that are ‘windows’ formed by taking words within a distance w (usually a tunable parameter) of words from an occurrence of q in the corpus.

T339

5514-5706

Sentence

denotes

The dynamic adjacency engine generalizes this notion of context in a natural way to logical queries: the context for a logical q can be generalized as a certain set of fixed-length fragments.

T340

5708-5722

Sentence

denotes

Co-occurrences

T341

5723-5786

Sentence

denotes

This is just the number of times t appears in the context of q.

T342

5788-5803

Sentence

denotes

Traditional PMI

T343

5804-5831

Sentence

denotes

This is log(p(t | q)/p(t)).

T344

5832-6088

Sentence

denotes

Here p(t | q) is the number of times t occurs in the context of q (ie co-occurrences of t and q) divided by the total length of all q contexts in the corpus, whereas p(t) is the number of occurrences of t in the entire corpus, divided by the corpus length.

T345

6090-6116

Sentence

denotes

Word2vec cosine similarity

T346

6117-6260

Sentence

denotes

The popular word2vec algorithm (Raj et al., 2013) generates a vector (we use 300-dimensional vector representation) for each token in a corpus.

T347

6261-6348

Sentence

denotes

The purpose of these vectors is usually to be used as features in downstream NLP tasks.

T348

6349-6390

Sentence

denotes

But they can also be used for similarity.

T349

6391-6556

Sentence

denotes

The original paper validates the vectors by testing them on word similarity tasks: the association score is the cosine between the vector for q and the vector for t.

T350

6557-6599

Sentence

denotes

This score only applies to single-token q.

T351

6601-6630

Sentence

denotes

Exponential mask PMI (ExpPMI)

T352

6631-6668

Sentence

denotes

This is our first new proposed score.

T353

6669-6751

Sentence

denotes

PMI treats every position in a binary way: it’s either in the context of q or not.

T354

6752-6902

Sentence

denotes

With a window size of say 50, a token which appears three words from a query q and a token which appears 45 words from a query q are treated the same.

T355

6903-7075

Sentence

denotes

We thought it might be useful to consider a measure which distinguishes positions in the context based on the number of words away that position is from an occurrence of q.

T356

7076-7161

Sentence

denotes

We did this by weighting the positions in the context by some weight between 0 and 1.

T357

7162-7299

Sentence

denotes

Our weighting is based on an exponential decay (which has some nice properties especially when we extend to the case of logical queries).

T358

7301-7312

Sentence

denotes

Local score

T359

7313-7348

Sentence

denotes

This is another new proposed score.

T360

7349-7462

Sentence

denotes

We find that PMI and ExpPMI can vary a lot for small samples (i.e. small numbers of co-occurrences, occurrences).

T361

7463-7602

Sentence

denotes

The Local Score is log(coocc) * sigmoid(PMI - 0.5), constructed to correct for this; we found that this formula too works well empirically.

T362

7604-7648

Sentence

denotes

Exponential mask local score (ExpLocalScore)

T363

7649-7761

Sentence

denotes

We apply both modifications together: the exponential mask score is log(weighted_coocc) * sigmoid(expPMI - 0.5).

T364

7762-7839

Sentence

denotes

Here weighted_coocc is the sum of the weights of the positions of the corpus.

T365

7841-7892

Sentence

denotes

Evaluation of literature-derived association scores

T366

7893-7974

Sentence

denotes

We need a notion of ground truth to evaluate the quality of association measures.

T367

7975-8095

Sentence

denotes

We use sets of known pairs of related entities versus a ‘control’ group of random pairs of entities of the same classes.

T368

8096-8139

Sentence

denotes

We use a few different sets of known pairs:

T369

8140-8200

Sentence

denotes

Disease-Gene relationships based on OMIM (Park et al., 2020)

T370

8201-8234

Sentence

denotes

Drug-Gene relationships (Table 1)

T371

8235-8281

Sentence

denotes

Drug-Disease relationships based on FDA labels

T372

8282-8318

Sentence

denotes

Drugs and their on-label indications

T373

8319-8358

Sentence

denotes

Drugs and their on-label adverse events

T374

8359-8395

Sentence

denotes

Logical queries for ambiguous tokens

T375

8396-8525

Sentence

denotes

One demonstration of the use of the logical query system is to disambiguate a token by conjoining it with a disambiguating token.

T376

8526-8699

Sentence

denotes

An example is clearer: the token ‘egfr’ can refer to the gene entity epidermal growth factor receptor, but also the test measure entity estimated glomerular filtration rate.

T377

8700-8819

Sentence

denotes

A query ‘egfr AND kidney’ should return results related to the latter meaning, while ‘egfr AND lung_cancer’ the former.

T378

8820-8917

Sentence

denotes

In particular, an unambiguous referent to the right entity should be highly related to the query.

T379

8918-9083

Sentence

denotes

So example known pairs in this data are (‘egfr AND kidney’, ‘estimated_glomerular_filtration_rate’) and (‘egfr AND lung_cancer’, ‘epidermal_growth_factor_receptor’).

T380

9084-9188

Sentence

denotes

We used an internal set of ~200–300 such (‘A AND B’, ‘C’) pairs (originally built up for other reasons).

T381

9189-9194

Sentence

denotes

Note:

T382

9195-9521

Sentence

denotes

One key drawback of the word2vec vector cosine similarity (Park et al., 2020; Mikolov et al., 2013b) method is its inability to get scores for logical queries as described above, because the method (Mikolov et al., 2013b) does not address the question of how to get vectors for queries that are logical combinations of tokens.

T383

9523-9541

Sentence

denotes

Evaluation metrics

T384

9542-9706

Sentence

denotes

Given a scoring method and a particular set of positive/control pairs, we get two sets of scores: one set for the positive pairs and one set for the negative pairs.

T385

9707-9717

Sentence

denotes

Cohen’s d:

T386

9718-9822

Sentence

denotes

We compute the Cohen’s d standard statistical measure of distance between two samples (Cohen’s D, 2016).

T387

9823-10010

Sentence

denotes

Mann-Whitney U (normalized): - The Mann-Whitney U is a nonparametric measure of distribution distance: it counts the number of transposed pairs (Contributors to Wikimedia projects, 2004).

T388

10012-10058

Sentence

denotes

Metrics based on training a 1-d logistic model

T389

10059-10171

Sentence

denotes

In this test, we are discriminating between two classes (true association/non-association) based on one feature.

T390

10172-10283

Sentence

denotes

We have two metrics based on fitting a 1-feature logistic curve to the data. (Figure 1—figure supplement 1A–B).

T391

10284-10296

Sentence

denotes

Brier score:

T392

10297-10538

Sentence

denotes

The Brier score is the average squared error of the logistic curve above: that is, for each labeled point, we square the vertical distance to the logistic curve, and average over all labeled points (Contributors to Wikimedia projects, 2005).

T393

10539-10567

Sentence

denotes

Log loss (dansbecker, 2018):

T394

10568-10667

Sentence

denotes

The logistic log loss is the average -log [model probability of true label] for each labeled point.

T395

10668-10724

Sentence

denotes

If the model is perfect at the point, it incurs no loss.

T396

10725-10770

Sentence

denotes

If it predicts 0.5, it incurs -log[0.5] loss.

T397

10771-10933

Sentence

denotes

If it predicts ‘yes’ with certainty when the answer is ‘no’ it incurs infinite loss (a logistic function never touches 0 or one so this won’t happen in our case).

T398

10934-10953

Sentence

denotes

Neg log percentile:

T399

10954-11040

Sentence

denotes

For most of the scoring rules, we also include a -log(percentile) version of the rule.

T400

11041-11113

Sentence

denotes

This is constructed as follows, for query q, token t, and score S(q, t):

T401

11114-11168

Sentence

denotes

Compute the scores S(q, t’) for q with every token t’.

T402

11169-11215

Sentence

denotes

Let R be the number of these that are nonzero.

T403

11216-11270

Sentence

denotes

Take the rank r of S(q, t) among all nonzero S(q, t’).

T404

11271-11340

Sentence

denotes

The neg log percentile score nlS(q, t) associated with S is -log(r/R)

T405

11341-11355

Sentence

denotes

We do this to:

T406

11356-11394

Sentence

denotes

control for differences across queries

T407

11395-11504

Sentence

denotes

control for differences in the shapes of the distributions that different association scoring functions take.

T408

11505-11576

Sentence

denotes

This procedure maps all the S(q, t’) to an Exponential(1) distribution.

T409

11577-11718

Sentence

denotes

We chose Exponential(1) because it is simple, intuitively reasonable and many of the scores naturally seemed to be approximately exponential.

T410

11720-11804

Sentence

denotes

High-dimensional word embeddings for determining the significant global associations

T411

11805-12214

Sentence

denotes

Figure 1—figure supplement 1C illustrates two histograms generated from a random set of vectors (in the vector space generated by the Neural Network) where one distribution represents all vector pairs whose cosine similarity is less than 0.32 (deemed ‘not strong associations’) and the other distribution represents all vector pairs whose cosine similarity is greater than 0.32 (deemed ‘strong associations’).

T412

12215-12375

Sentence

denotes

This can show how common a phenomenon it is to find word vector pairs that have very good cosine similarity values but yet not co-occur even once in the corpus.

T413

12376-12588

Sentence

denotes

The ‘cosine similarity >= 0.32’ bar at zero value suggests that roughly 11% of vector pairs whose cosine similarity where greater than 0.32 (‘strong associations’) never occurred together even once in a document.

T414

12589-12937

Sentence

denotes

It is also clear from the figure that albeit more of the mass of the ‘cosine similarity >= 0.32’ distribution is skewed to the right as expected (more co-occurrences and hence unsurprisingly larger cosine similarity values), there is a long tail of the ‘cosine similarity < 0.32’ distribution (very high co-occurrences but small cosine similarity).

T415

12938-13157

Sentence

denotes

The long tail is a direct consequence of negative sampling—where vectors corresponding to common words that co-occur quite often with significant words in a sliding window are moved away from vectors of the other words.

T416

13159-13252

Sentence

denotes

What does the word2vec neural network do from the perspective of Genes-Diseases associations?

T417

13253-13571

Sentence

denotes

One way to view the word2vec ‘black box’ operation from a Genes/Diseases perspective (cosine of <Gene, Disease> for all Genes and Diseases) is as a Transfer Function which changed the input probability distribution (pre-training randomly assigned word vectors for Genes and Diseases) to a new probability distribution.

T418

13572-13780

Sentence

denotes

The ‘null hypothesis’ (which seems to be well preserved in actuality in the way word2vec assigns random values to vectors initially) is the ‘green colored’ Cosine Distribution (Figure 1—figure supplement 1D).

T419

13781-14034

Sentence

denotes

Once word2vec training is over, the final word vectors are placed in specific positions in the 300-dimensional space so as to present the ‘blue colored’ Empirical distribution (the actual cosine similarity between <Gene, Disease> pairs that we observe).

T420

14035-14219

Sentence

denotes

The ‘orange curve’ is the 2-Gamma mixture (the parametric distribution that captures the ‘empirical distribution’ with just eight parameters (two alphas, two betas, 2 ts and two phis).

T421

14220-14252

Sentence

denotes

Observations from this analysis:

T422

14253-14361

Sentence

denotes

Note the ‘symmetrical’ cosine distribution after training becomes ‘Asymmetrical’ with a longer ‘right tail’.

T423

14362-14465

Sentence

denotes

The asymmetry is the reason why Gamma distribution worked better than say, Gaussian, for the curve fit.

T424

14466-14945

Sentence

denotes

The mean of the distribution gets shifted to the right after training as one would expect — the vectors during training are ‘brought together’ by parallelogram addition predominantly— explaining the shift to the right (negative sampling will cause a movement in the opposite direction, but that will disproportionately affect the ‘ultra-high frequency’ words, which get ‘more’ positively sampled and hence the 3-gamma with a bump near 0.6 happens for ultra-high frequency words).

T425

14946-15032

Sentence

denotes

The most interesting associations, by definition, are in the tail of the distribution.

T426

15034-15179

Sentence

denotes

What does varying the number of dimensions in the word2vec space do to the underlying cosine similarity distributions in a large textual corpora?

T427

15180-15395

Sentence

denotes

Figure 1—figure supplement 1E illustrates a cosine similarity probability density function (PDF) graph to visually describe the implementation of the word2vec-like Vector Space Model in various N-dimensional spaces.

T428

15396-15756

Sentence

denotes

As described in the Materials and methods section, the system is a Semantic Bio-Knowledge Graph of nodes representing the words/phrases chosen to be represented as vectors and edge weights determined by measures of Semantic Association Strength (e.g. the cosine similarity between a pair of word embeddings represented as vectors in a large dimensional space).

T429

15757-15874

Sentence

denotes

The cosine similarity ranges from 0 (representing no semantic association) to 1 (representing strongest association).

T430

15875-15982

Sentence

denotes

This metric of association can reflect the contextual similarity of the entities in the Biomedical Corpora.

T431

15983-16092

Sentence

denotes

The typical dimensionality used by our neural network for generating the Global Scores is n = 300 dimensions.

T432

16093-16307

Sentence

denotes

This is because, as can be seen in the graph, the distribution is highly peaked with most of the mass centered around 0 -- that is, a randomly chosen pair of vectors typically are orthogonal or close to orthogonal.

T433

16308-16453

Sentence

denotes

Furthermore, over 300 dimensions, the distributions all have sufficiently long tails with the most interesting (salient) biomedical associations.

T434

16455-16492

Sentence

denotes

Single-cell RNA-seq analysis platform

T435

16493-16618

Sentence

denotes

The objective of the single cell platform is to enable dynamic visualization and analysis of single-cell RNA-sequencing data.

T436

16619-16948

Sentence

denotes

Currently, there are over 30 scRNAseq studies available for analysis in the Single Cell app, including studies from human donors/patients covering tissues such as adipose tissue, blood, bone marrow, colon, esophagus, liver, lung, kidney, ovary, nasal epithelium, pancreas, placenta, prostate, retina, small intestine, and spleen.

T437

16949-17101

Sentence

denotes

Because no pan-tissue reference dataset yet exists for humans, we have manually selected individual studies to maximally cover the set of human tissues.

T438

17102-17269

Sentence

denotes

In some cases, these studies contain cells from both healthy donors and patients affected by a specified pathology such as ulcerative colitis (colon) or asthma (lung).

T439

17270-17621

Sentence

denotes

There are also a number of murine scRNAseq studies covering tissues including adipose tissue, airway epithelium, blood, bone marrow, brain, breast, colon, heart, kidney, liver, lung, ovary, pancreas, placenta, prostate, skeletal muscle, skin, spleen, stomach, small intestine, testis, thymus, tongue, trachea, urinary bladder, uterus, and vasculature.

T440

17622-17721

Sentence

denotes

Note that two of these murine studies (Tabula Muris and Mouse Cell Atlas) include ~20 tissues each.

T441

17723-17759

Sentence

denotes

Single-cell data processing pipeline

T442

17760-17944

Sentence

denotes

For each study, a counts matrix was downloaded from a public data repository such as the Gene Expression Omnibus (GEO) or the Broad Institute Single Cell Portal (Supplementary file 1).

T443

17945-18154

Sentence

denotes

Note that this data has not been re-processed from the raw sequencing output, and so it is likely that alignment and quantification of gene expression was performed using different tools for different studies.

T444

18155-18248

Sentence

denotes

In some cases, multiple complementary datasets have been generated from a single publication.

T445

18249-18328

Sentence

denotes

In these cases, we have generated separate entries in the Single Cell platform.

T446

18329-18337

Sentence

denotes

Table 1.

T447

18339-18361

Sentence

denotes

Results of evaluation.

T448

18362-18415

Sentence

denotes

Performance of approximately 2100 disease-gene pairs.

T449

18416-18512

Sentence

denotes

Assoc score↓ Cohen’s d (+) Mann-W U norm. (-) Logistic log loss (-) Logistic Brier score (-)

T450

18513-18551

Sentence

denotes

Cosine (w2v) 1.31 0.197 0.51 0.168

T451

18552-18587

Sentence

denotes

Raw PMI 2.07 0.0953 0.374 0.116

T452

18588-18636

Sentence

denotes

Raw PMI -log(pctile) 2.15 0.0947 0.355 0.111

T453

18637-18672

Sentence

denotes

Exp PMI 2.17 0.0897 0.356 0.109

T454

18673-18721

Sentence

denotes

Exp PMI -log(pctile) 2.21 0.0903 0.341 0.105

T455

18722-18766

Sentence

denotes

Raw Local Score 2.35 0.0828 0.312 0.0947

T456

18767-18824

Sentence

denotes

Raw Local Score -log(pctile) 2.28 0.0832 0.317 0.0963

T457

18825-18873

Sentence

denotes

Exp Local Score 2.34 0.0812 *0.301 *0.0915

T458

18874-18934

Sentence

denotes

Exp Local Score -log(pctile) *2.36 *0.0811 0.308 0.093

T459

18935-18972

Sentence

denotes

log(coocc) 2.24 0.097 0.348 0.105

T460

18973-19007

Sentence

denotes

Interpretation of the above table.

T461

19008-19118

Sentence

denotes

Each row corresponds to an association score whereas each column corresponds to one of the evaluation metrics.

T462

19119-19264

Sentence

denotes

A (+) in the column means a higher evaluation metric value, the better the association score in that row separates the positive and random pairs.

T463

19265-19313

Sentence

denotes

A (-) means a lower evaluation metric is better.

T464

19314-19415

Sentence

denotes

Note all the metrics are immune to linear rescalings; also the Mann-Whitney U score is nonparametric.

T465

19416-19754

Sentence

denotes

While counts matrices have been generated using different technologies (e.g. Drop-Seq, 10x Genomics, etc.) and different alignment/pre-processing pipelines, all counts matrices were scaled such that each cell contains a total of 10,000 scaled counts (i.e. the sum of expression values for all genes equals 10,000 in each individual cell).

T466

19755-19839

Sentence

denotes

All data were uniformly processed using the Seurat v3 package (Butler et al., 2018).

T467

19840-19893

Sentence

denotes

In short, this pipeline involves the following steps.

T468

19894-20045

Sentence

denotes

First, we identify 2000 variable genes across the given dataset and then perform linear dimensionality reduction by principal component analysis (PCA).

T469

20046-20409

Sentence

denotes

Using the set of principal components which contribute >80% of variance across the dataset, we then do the following: (i) perform graph-based clustering to identify groups of cells with similar expression profiles (Louvain clustering), (ii) compute UMAP and tSNE coordinates for each individual cell (used for data visualization) and (iii) annotate cell clusters.

T470

20410-20665

Sentence

denotes

Note that the three human pancreatic datasets (GSE81076, GSE85241, GSE86469) were integrated together in a shared multi-dimensional space using CCA (Canonical Correlation Analysis) and the integration method in the Seurat v3 package (Butler et al., 2018).

T471

20666-20780

Sentence

denotes

Cell clustering and computation of dimensionality reduction coordinates were performed on this integrated dataset.

T472

20782-20805

Sentence

denotes

Cell cluster annotation

T473

20806-21031

Sentence

denotes

In cases where publicly deposited counts matrices are accompanied by author-assigned annotations for individual cells or clusters, we have retained these cell annotations for display in the platform and accompanying analyses.

T474

21032-21416

Sentence

denotes

For any study which was not accompanied by a metadata file containing cluster annotations, we have manually labeled clusters based on sets of canonical ‘cluster-defining genes.’ In these cases, we have attempted to leverage annotations and descriptions of gene expression patterns described by study authors in the manuscript text and figures corresponding to the data being analyzed.

T475

21418-21468

Sentence

denotes

Metrics to summarize cluster-level gene expression

T476

21469-21535

Sentence

denotes

The platform allows users to query any gene in any selected study.

T477

21536-21683

Sentence

denotes

The corresponding data is displayed in commonly employed formats including a series of violin plots and as a set of dimensionality reduction plots.

T478

21684-21835

Sentence

denotes

Expression is summarized by listing the percent of cells expressing Gene G in each annotated cluster and the mean expression of Gene G in each cluster.

T479

21836-22070

Sentence

denotes

To measure the specificity of Gene G expression to each Cluster C, we compute a Cohen’s D value which assesses the effect size between the mean expression of Gene G in cluster C and the mean expression of Gene G in all other clusters.

T480

22071-22283

Sentence

denotes

Specifically, the Cohen’s D formula is given as follows: (MeanC - MeanA)/(sqrt(StDevC2 + StDevA2)) , where C represents the cluster of interest and A represents the complement of C (i.e. all other cell clusters).

T481

22284-22461

Sentence

denotes

Note that this is functionally similar to the computation of paired fold change values and p-values between clusters which is frequently used to identify cluster-defining genes.

T482

22463-22490

Sentence

denotes

Gene-gene cosine similarity

T483

22491-22641

Sentence

denotes

Within the platform, we support the run-time computation of cosine similarity (i.e. 1 - cosine distance) between the queried gene and all other genes.

T484

22642-22768

Sentence

denotes

This provides a measure of expression similarity across cells and can be used to identify co-regulated and co-expressed genes.

T485

22769-22868

Sentence

denotes

Specifically, to perform this computation, we construct a ‘gene expression vector’ for each gene G.

T486

22869-23000

Sentence

denotes

This corresponds to the set of CP10K values for gene G in each individual cell from the selected populations in the selected study.

T487

23002-23071

Sentence

denotes

Profiling expression of coronavirus receptors in single-cell datasets

T488

23072-23163

Sentence

denotes

For each single-cell dataset, we examined the expression of ACE2, TMPRSS2, ANPEP, and DPP4.

T489

23164-23318

Sentence

denotes

We generally considered a cell population to potentially express a gene if at least 5% of cells from that cluster showed non-zero expression of this gene.

T490

23319-23549

Sentence

denotes

For each dataset, we show a figure which includes a UMAP dimensionality reduction plot colored by annotated cell type along with identical plots colored by the expression level of each coronavirus receptor in all individual cells.

T491

23550-23805

Sentence

denotes

In some cases, we also show violin plots from the platform which automatically integrate literature-derived insights to highlight whether there exist textual associations between the queried gene and the tissue/cell types identified in the selected study.

T492

23807-23858

Sentence

denotes

FDA Adverse Event Reporting System (FAERS) analysis

T493

23859-24075

Sentence

denotes

The FAERS application of the nferX platform supports viewing adverse event profiles of all marketed products through multiple lenses - Count, Proportional Reporting Ratio (PRR), and an nferX Adverse Event (AE) Score.

T494

24076-24184

Sentence

denotes

AEScore=ln(count)∗1/(1+e−(prr−1.5)). Count is the raw number of reports between a drug and an adverse event.

T495

24185-24373

Sentence

denotes

The proportional reporting ratio (PRR) is a simple way to get a measure of how common an adverse event for a particular drug is compared to how common the event is in the overall database.

T496

24374-24732

Sentence

denotes

A PRR >1 for a drug-event combination indicates that a greater proportion of the reports for the drug are for the event than the proportion of events in the rest of the database, while a PRR of 2 for a drug event combination indicates that the proportion of reports for the drug-event combination is twice the proportion of the event in the overall database.

T497

24733-24787

Sentence

denotes

The PRR is computed as follows:PRR=(m/n)/((M−m)/(N−n))

T498

24788-24829

Sentence

denotes

m = number of reports with drug and event

T499

24830-24861

Sentence

denotes

n = number of reports with drug

T500

24862-24906

Sentence

denotes

M = number of reports with event in database

T501

24907-24940

Sentence

denotes

N = number of reports in database

T502

24941-25016

Sentence

denotes

Count of an event with a query drug is a good first measure of association.

T503

25017-25217

Sentence

denotes

But it has the problem that generally common events will often show up at the top, where we are often more interested in events that are differentially associated with the query drug over other drugs.

T504

25218-25304

Sentence

denotes

An issue with PRR is that it is noisy when the total number of event reports is small.

T505

25305-25701

Sentence

denotes

If there are three reports of some oddly specific event and one occurs with the query drug, that event will likely have a very high PRR, but it may not be the event we would be most interested in for a drug (in FAERS such rare events are often not even proper adverse events) - we want events that occur often, and also are differentially associated with a drug - a balance between count and PRR.

T506

25702-25769

Sentence

denotes

The AE score tries to strike this balance in an all-in-one measure.

T507

25770-25943

Sentence

denotes

It up-weights events that occur often for the query drug (this is the ln(count) term), and that are differentially associated with the query drug (this is the sigmoid term).

T508

25944-25998

Sentence

denotes

The sigmoid(PRR-1.5) term ranges smoothly from 0 to 1.

T509

25999-26030

Sentence

denotes

It's equal to 0.5 at PRR = 1.5.

T510

26031-26140

Sentence

denotes

When PRR = 6, sigmoid(PRR-1.5)=0.99; so PRR values >= 6 are all treated roughly equivalently by the AE score.

T511

26141-26361

Sentence

denotes

Thus, extremely high PRRs due to small counts will not swing the AE score much beyond PRR = 6, and the ln(count) term will down-weight those small-count cases, so that they do not show up at the top of the AE score list.

T512

26362-26548

Sentence

denotes

A nice property of AE score is that, for a given query drug, the AE scores of the events with that drug turn out to roughly follow an exponential distribution, particularly at the tails.

T513

26549-26623

Sentence

denotes

We can then fit exponential distributions to the scores, and analyze them.

T514

26624-26833

Sentence

denotes

A benefit of the exponential fit is that we can make more robust claims about how significant a certain score is for a query drug, even if the empirical data is sparse/noisy at the tails for a particular drug.

PMC:7371427 / 56301-83134 JSON TXT 12 Projects

Annnotations TAB TSV DIC JSON TextAE

PMC:7371427 / 56301-83134 JSONTXT 12 Projects

Annnotations TAB TSV DIC JSON TextAE

PMC:7371427 / 56301-83134 JSON TXT 12 Projects