Id |
Subject |
Object |
Predicate |
Lexical cue |
T308 |
0-21 |
Sentence |
denotes |
Materials and methods |
T309 |
23-97 |
Sentence |
denotes |
Unstructured biomedical knowledge synthesis and triangulation capabilities |
T310 |
98-281 |
Sentence |
denotes |
In order to capture biomedical literature-based associations, the nferX platform defines two scores: a ‘local score’ and a ‘global score’, as described previously (Park et al., 2020). |
T311 |
282-618 |
Sentence |
denotes |
Briefly, the local score is obtained from applying a traditional natural language processing technique which captures the strength of association between two concepts in a selected corpus of biomedical literature based on the frequency of their co-occurrence normalized by the frequency of each individual concept throughout the corpus. |
T312 |
619-809 |
Sentence |
denotes |
A higher local score between Concept X and Concept Y indicates that these concepts are frequently mentioned in close proximity to each other more frequently than would be expected by chance. |
T313 |
810-957 |
Sentence |
denotes |
The global score, on the other hand, is based on the neural network renaissance that has recently taken place in Natural Language Processing (NLP). |
T314 |
958-1088 |
Sentence |
denotes |
To compute global scores, all tokens (e.g. words and phrases) are projected in a high-dimensional vector space of word embeddings. |
T315 |
1089-1188 |
Sentence |
denotes |
These vectors serve to represent the ‘neighborhood’ of concepts which occur around a given concept. |
T316 |
1189-1414 |
Sentence |
denotes |
The cosine similarity between any two vectors measures the similarity of these neighborhoods and is the basis for our global score metric, where concepts which are more similar in this vector space have a higher global score. |
T317 |
1415-1655 |
Sentence |
denotes |
While the global scores in this work are computed in the embedding space of word2vec model, it can also be computed in the embedding space of any deep learning model including recent transformer-based models like BERT (Devlin et al., 2019). |
T318 |
1656-1817 |
Sentence |
denotes |
These may have complementary benefits to word2vec embeddings since the embeddings are context sensitive having different vectors for different sentence contexts. |
T319 |
1818-2056 |
Sentence |
denotes |
However, despite the context sensitive nature of BERT embeddings a global score computation for a phrase may still be of value given the score is computed across sentence embeddings capturing the context sensitive nature of those phrases. |
T320 |
2057-2283 |
Sentence |
denotes |
From a visualization perspective, the local score and global score (‘Signals’) are represented in the platform using bubbles where bubble size corresponds to the local score and color intensity corresponds to the global score. |
T321 |
2284-2409 |
Sentence |
denotes |
This allows users to rapidly determine the strength of association between any two concepts throughout biomedical literature. |
T322 |
2410-2568 |
Sentence |
denotes |
We consider concepts which show both high local and global scores to be ‘concordant’ and have found that these typically recapitulate well-known associations. |
T323 |
2569-2722 |
Sentence |
denotes |
One key aspect of the nferX platform is that it allows the user to query associated concepts for a virtually unbounded number of possible query concepts. |
T324 |
2723-2765 |
Sentence |
denotes |
This is achieved by means of two features: |
T325 |
2766-3018 |
Sentence |
denotes |
Firstly, the nferX platform allows the user to compose queries using the logical AND, OR and NOT operators to logically combine any number of biomedical concepts in a query, each combination amounting to a gross or nuanced composite biomedical concept. |
T326 |
3019-3521 |
Sentence |
denotes |
Secondly, since logical combinations yield a virtually unbounded number of biomedical concepts that can be queries, the nferX platform implements a completely dynamic method of computing local scores on the fly by using novel high performance parallel and distributed algorithms that, in real time, scan hundreds of millions of documents to quickly locate user query related text fragments and count co-occurring biomedical concepts for computing strength of association scores and their significances. |
T327 |
3522-3870 |
Sentence |
denotes |
The platform further leverages statistical inference to calculate ‘enrichments’ based on structured data, thus enabling real-time triangulation of signals from the unstructured biomedical knowledge graph various other structured databases (e.g. curated ontologies, RNA-sequencing datasets, human genetic associations, protein-protein interactions). |
T328 |
3871-4042 |
Sentence |
denotes |
This facilitates unbiased hypothesis-free learning and faster pattern recognition, and it allows users to more holistically determine the veracity of concept associations. |
T329 |
4043-4249 |
Sentence |
denotes |
Finally, the platform allows the user to identify and further examine the documents and textual fragments from which the knowledge synthesis signals are derived using the Documents and Signals applications. |
T330 |
4251-4269 |
Sentence |
denotes |
Association scores |
T331 |
4270-4602 |
Sentence |
denotes |
Having a method that automatically consumes a corpus and computes a numeric score that captures the strength of the association between any pair of entities is obviously beneficial because then given any entity, its association strength score with all other entities can be sorted to find a sorted list of other associated entities. |
T332 |
4603-4727 |
Sentence |
denotes |
The number of times two entities mutually co-occur in ‘small’ vicinities of a corpus is the basis of all association scores. |
T333 |
4728-4911 |
Sentence |
denotes |
One popular traditional measure for association strength between tokens in text is pointwise mutual information, or PMI (Evert, 2005), which we consider in several association scores. |
T334 |
4913-4936 |
Sentence |
denotes |
Measures of association |
T335 |
4937-5067 |
Sentence |
denotes |
Formally, an association score is some real-valued function S(q, t) where q is a query token/entity and t is another token/entity. |
T336 |
5068-5151 |
Sentence |
denotes |
One important notion, the ‘vicinity’ of q, we formally denote as the Context of q : |
T337 |
5152-5231 |
Sentence |
denotes |
The context of q are those corpus segments deemed to be ‘near’ or ‘local’ to q. |
T338 |
5232-5513 |
Sentence |
denotes |
For single token queries (where q is a single entity and not a logical combination of entities) , q’s context consists of all corpus segments that are ‘windows’ formed by taking words within a distance w (usually a tunable parameter) of words from an occurrence of q in the corpus. |
T339 |
5514-5706 |
Sentence |
denotes |
The dynamic adjacency engine generalizes this notion of context in a natural way to logical queries: the context for a logical q can be generalized as a certain set of fixed-length fragments. |
T340 |
5708-5722 |
Sentence |
denotes |
Co-occurrences |
T341 |
5723-5786 |
Sentence |
denotes |
This is just the number of times t appears in the context of q. |
T342 |
5788-5803 |
Sentence |
denotes |
Traditional PMI |
T343 |
5804-5831 |
Sentence |
denotes |
This is log(p(t | q)/p(t)). |
T344 |
5832-6088 |
Sentence |
denotes |
Here p(t | q) is the number of times t occurs in the context of q (ie co-occurrences of t and q) divided by the total length of all q contexts in the corpus, whereas p(t) is the number of occurrences of t in the entire corpus, divided by the corpus length. |
T345 |
6090-6116 |
Sentence |
denotes |
Word2vec cosine similarity |
T346 |
6117-6260 |
Sentence |
denotes |
The popular word2vec algorithm (Raj et al., 2013) generates a vector (we use 300-dimensional vector representation) for each token in a corpus. |
T347 |
6261-6348 |
Sentence |
denotes |
The purpose of these vectors is usually to be used as features in downstream NLP tasks. |
T348 |
6349-6390 |
Sentence |
denotes |
But they can also be used for similarity. |
T349 |
6391-6556 |
Sentence |
denotes |
The original paper validates the vectors by testing them on word similarity tasks: the association score is the cosine between the vector for q and the vector for t. |
T350 |
6557-6599 |
Sentence |
denotes |
This score only applies to single-token q. |
T351 |
6601-6630 |
Sentence |
denotes |
Exponential mask PMI (ExpPMI) |
T352 |
6631-6668 |
Sentence |
denotes |
This is our first new proposed score. |
T353 |
6669-6751 |
Sentence |
denotes |
PMI treats every position in a binary way: it’s either in the context of q or not. |
T354 |
6752-6902 |
Sentence |
denotes |
With a window size of say 50, a token which appears three words from a query q and a token which appears 45 words from a query q are treated the same. |
T355 |
6903-7075 |
Sentence |
denotes |
We thought it might be useful to consider a measure which distinguishes positions in the context based on the number of words away that position is from an occurrence of q. |
T356 |
7076-7161 |
Sentence |
denotes |
We did this by weighting the positions in the context by some weight between 0 and 1. |
T357 |
7162-7299 |
Sentence |
denotes |
Our weighting is based on an exponential decay (which has some nice properties especially when we extend to the case of logical queries). |
T358 |
7301-7312 |
Sentence |
denotes |
Local score |
T359 |
7313-7348 |
Sentence |
denotes |
This is another new proposed score. |
T360 |
7349-7462 |
Sentence |
denotes |
We find that PMI and ExpPMI can vary a lot for small samples (i.e. small numbers of co-occurrences, occurrences). |
T361 |
7463-7602 |
Sentence |
denotes |
The Local Score is log(coocc) * sigmoid(PMI - 0.5), constructed to correct for this; we found that this formula too works well empirically. |
T362 |
7604-7648 |
Sentence |
denotes |
Exponential mask local score (ExpLocalScore) |
T363 |
7649-7761 |
Sentence |
denotes |
We apply both modifications together: the exponential mask score is log(weighted_coocc) * sigmoid(expPMI - 0.5). |
T364 |
7762-7839 |
Sentence |
denotes |
Here weighted_coocc is the sum of the weights of the positions of the corpus. |
T365 |
7841-7892 |
Sentence |
denotes |
Evaluation of literature-derived association scores |
T366 |
7893-7974 |
Sentence |
denotes |
We need a notion of ground truth to evaluate the quality of association measures. |
T367 |
7975-8095 |
Sentence |
denotes |
We use sets of known pairs of related entities versus a ‘control’ group of random pairs of entities of the same classes. |
T368 |
8096-8139 |
Sentence |
denotes |
We use a few different sets of known pairs: |
T369 |
8140-8200 |
Sentence |
denotes |
Disease-Gene relationships based on OMIM (Park et al., 2020) |
T370 |
8201-8234 |
Sentence |
denotes |
Drug-Gene relationships (Table 1) |
T371 |
8235-8281 |
Sentence |
denotes |
Drug-Disease relationships based on FDA labels |
T372 |
8282-8318 |
Sentence |
denotes |
Drugs and their on-label indications |
T373 |
8319-8358 |
Sentence |
denotes |
Drugs and their on-label adverse events |
T374 |
8359-8395 |
Sentence |
denotes |
Logical queries for ambiguous tokens |
T375 |
8396-8525 |
Sentence |
denotes |
One demonstration of the use of the logical query system is to disambiguate a token by conjoining it with a disambiguating token. |
T376 |
8526-8699 |
Sentence |
denotes |
An example is clearer: the token ‘egfr’ can refer to the gene entity epidermal growth factor receptor, but also the test measure entity estimated glomerular filtration rate. |
T377 |
8700-8819 |
Sentence |
denotes |
A query ‘egfr AND kidney’ should return results related to the latter meaning, while ‘egfr AND lung_cancer’ the former. |
T378 |
8820-8917 |
Sentence |
denotes |
In particular, an unambiguous referent to the right entity should be highly related to the query. |
T379 |
8918-9083 |
Sentence |
denotes |
So example known pairs in this data are (‘egfr AND kidney’, ‘estimated_glomerular_filtration_rate’) and (‘egfr AND lung_cancer’, ‘epidermal_growth_factor_receptor’). |
T380 |
9084-9188 |
Sentence |
denotes |
We used an internal set of ~200–300 such (‘A AND B’, ‘C’) pairs (originally built up for other reasons). |
T381 |
9189-9194 |
Sentence |
denotes |
Note: |
T382 |
9195-9521 |
Sentence |
denotes |
One key drawback of the word2vec vector cosine similarity (Park et al., 2020; Mikolov et al., 2013b) method is its inability to get scores for logical queries as described above, because the method (Mikolov et al., 2013b) does not address the question of how to get vectors for queries that are logical combinations of tokens. |
T383 |
9523-9541 |
Sentence |
denotes |
Evaluation metrics |
T384 |
9542-9706 |
Sentence |
denotes |
Given a scoring method and a particular set of positive/control pairs, we get two sets of scores: one set for the positive pairs and one set for the negative pairs. |
T385 |
9707-9717 |
Sentence |
denotes |
Cohen’s d: |
T386 |
9718-9822 |
Sentence |
denotes |
We compute the Cohen’s d standard statistical measure of distance between two samples (Cohen’s D, 2016). |
T387 |
9823-10010 |
Sentence |
denotes |
Mann-Whitney U (normalized): - The Mann-Whitney U is a nonparametric measure of distribution distance: it counts the number of transposed pairs (Contributors to Wikimedia projects, 2004). |
T388 |
10012-10058 |
Sentence |
denotes |
Metrics based on training a 1-d logistic model |
T389 |
10059-10171 |
Sentence |
denotes |
In this test, we are discriminating between two classes (true association/non-association) based on one feature. |
T390 |
10172-10283 |
Sentence |
denotes |
We have two metrics based on fitting a 1-feature logistic curve to the data. (Figure 1—figure supplement 1A–B). |
T391 |
10284-10296 |
Sentence |
denotes |
Brier score: |
T392 |
10297-10538 |
Sentence |
denotes |
The Brier score is the average squared error of the logistic curve above: that is, for each labeled point, we square the vertical distance to the logistic curve, and average over all labeled points (Contributors to Wikimedia projects, 2005). |
T393 |
10539-10567 |
Sentence |
denotes |
Log loss (dansbecker, 2018): |
T394 |
10568-10667 |
Sentence |
denotes |
The logistic log loss is the average -log [model probability of true label] for each labeled point. |
T395 |
10668-10724 |
Sentence |
denotes |
If the model is perfect at the point, it incurs no loss. |
T396 |
10725-10770 |
Sentence |
denotes |
If it predicts 0.5, it incurs -log[0.5] loss. |
T397 |
10771-10933 |
Sentence |
denotes |
If it predicts ‘yes’ with certainty when the answer is ‘no’ it incurs infinite loss (a logistic function never touches 0 or one so this won’t happen in our case). |
T398 |
10934-10953 |
Sentence |
denotes |
Neg log percentile: |
T399 |
10954-11040 |
Sentence |
denotes |
For most of the scoring rules, we also include a -log(percentile) version of the rule. |
T400 |
11041-11113 |
Sentence |
denotes |
This is constructed as follows, for query q, token t, and score S(q, t): |
T401 |
11114-11168 |
Sentence |
denotes |
Compute the scores S(q, t’) for q with every token t’. |
T402 |
11169-11215 |
Sentence |
denotes |
Let R be the number of these that are nonzero. |
T403 |
11216-11270 |
Sentence |
denotes |
Take the rank r of S(q, t) among all nonzero S(q, t’). |
T404 |
11271-11340 |
Sentence |
denotes |
The neg log percentile score nlS(q, t) associated with S is -log(r/R) |
T405 |
11341-11355 |
Sentence |
denotes |
We do this to: |
T406 |
11356-11394 |
Sentence |
denotes |
control for differences across queries |
T407 |
11395-11504 |
Sentence |
denotes |
control for differences in the shapes of the distributions that different association scoring functions take. |
T408 |
11505-11576 |
Sentence |
denotes |
This procedure maps all the S(q, t’) to an Exponential(1) distribution. |
T409 |
11577-11718 |
Sentence |
denotes |
We chose Exponential(1) because it is simple, intuitively reasonable and many of the scores naturally seemed to be approximately exponential. |
T410 |
11720-11804 |
Sentence |
denotes |
High-dimensional word embeddings for determining the significant global associations |
T411 |
11805-12214 |
Sentence |
denotes |
Figure 1—figure supplement 1C illustrates two histograms generated from a random set of vectors (in the vector space generated by the Neural Network) where one distribution represents all vector pairs whose cosine similarity is less than 0.32 (deemed ‘not strong associations’) and the other distribution represents all vector pairs whose cosine similarity is greater than 0.32 (deemed ‘strong associations’). |
T412 |
12215-12375 |
Sentence |
denotes |
This can show how common a phenomenon it is to find word vector pairs that have very good cosine similarity values but yet not co-occur even once in the corpus. |
T413 |
12376-12588 |
Sentence |
denotes |
The ‘cosine similarity >= 0.32’ bar at zero value suggests that roughly 11% of vector pairs whose cosine similarity where greater than 0.32 (‘strong associations’) never occurred together even once in a document. |
T414 |
12589-12937 |
Sentence |
denotes |
It is also clear from the figure that albeit more of the mass of the ‘cosine similarity >= 0.32’ distribution is skewed to the right as expected (more co-occurrences and hence unsurprisingly larger cosine similarity values), there is a long tail of the ‘cosine similarity < 0.32’ distribution (very high co-occurrences but small cosine similarity). |
T415 |
12938-13157 |
Sentence |
denotes |
The long tail is a direct consequence of negative sampling—where vectors corresponding to common words that co-occur quite often with significant words in a sliding window are moved away from vectors of the other words. |
T416 |
13159-13252 |
Sentence |
denotes |
What does the word2vec neural network do from the perspective of Genes-Diseases associations? |
T417 |
13253-13571 |
Sentence |
denotes |
One way to view the word2vec ‘black box’ operation from a Genes/Diseases perspective (cosine of <Gene, Disease> for all Genes and Diseases) is as a Transfer Function which changed the input probability distribution (pre-training randomly assigned word vectors for Genes and Diseases) to a new probability distribution. |
T418 |
13572-13780 |
Sentence |
denotes |
The ‘null hypothesis’ (which seems to be well preserved in actuality in the way word2vec assigns random values to vectors initially) is the ‘green colored’ Cosine Distribution (Figure 1—figure supplement 1D). |
T419 |
13781-14034 |
Sentence |
denotes |
Once word2vec training is over, the final word vectors are placed in specific positions in the 300-dimensional space so as to present the ‘blue colored’ Empirical distribution (the actual cosine similarity between <Gene, Disease> pairs that we observe). |
T420 |
14035-14219 |
Sentence |
denotes |
The ‘orange curve’ is the 2-Gamma mixture (the parametric distribution that captures the ‘empirical distribution’ with just eight parameters (two alphas, two betas, 2 ts and two phis). |
T421 |
14220-14252 |
Sentence |
denotes |
Observations from this analysis: |
T422 |
14253-14361 |
Sentence |
denotes |
Note the ‘symmetrical’ cosine distribution after training becomes ‘Asymmetrical’ with a longer ‘right tail’. |
T423 |
14362-14465 |
Sentence |
denotes |
The asymmetry is the reason why Gamma distribution worked better than say, Gaussian, for the curve fit. |
T424 |
14466-14945 |
Sentence |
denotes |
The mean of the distribution gets shifted to the right after training as one would expect — the vectors during training are ‘brought together’ by parallelogram addition predominantly— explaining the shift to the right (negative sampling will cause a movement in the opposite direction, but that will disproportionately affect the ‘ultra-high frequency’ words, which get ‘more’ positively sampled and hence the 3-gamma with a bump near 0.6 happens for ultra-high frequency words). |
T425 |
14946-15032 |
Sentence |
denotes |
The most interesting associations, by definition, are in the tail of the distribution. |
T426 |
15034-15179 |
Sentence |
denotes |
What does varying the number of dimensions in the word2vec space do to the underlying cosine similarity distributions in a large textual corpora? |
T427 |
15180-15395 |
Sentence |
denotes |
Figure 1—figure supplement 1E illustrates a cosine similarity probability density function (PDF) graph to visually describe the implementation of the word2vec-like Vector Space Model in various N-dimensional spaces. |
T428 |
15396-15756 |
Sentence |
denotes |
As described in the Materials and methods section, the system is a Semantic Bio-Knowledge Graph of nodes representing the words/phrases chosen to be represented as vectors and edge weights determined by measures of Semantic Association Strength (e.g. the cosine similarity between a pair of word embeddings represented as vectors in a large dimensional space). |
T429 |
15757-15874 |
Sentence |
denotes |
The cosine similarity ranges from 0 (representing no semantic association) to 1 (representing strongest association). |
T430 |
15875-15982 |
Sentence |
denotes |
This metric of association can reflect the contextual similarity of the entities in the Biomedical Corpora. |
T431 |
15983-16092 |
Sentence |
denotes |
The typical dimensionality used by our neural network for generating the Global Scores is n = 300 dimensions. |
T432 |
16093-16307 |
Sentence |
denotes |
This is because, as can be seen in the graph, the distribution is highly peaked with most of the mass centered around 0 -- that is, a randomly chosen pair of vectors typically are orthogonal or close to orthogonal. |
T433 |
16308-16453 |
Sentence |
denotes |
Furthermore, over 300 dimensions, the distributions all have sufficiently long tails with the most interesting (salient) biomedical associations. |
T434 |
16455-16492 |
Sentence |
denotes |
Single-cell RNA-seq analysis platform |
T435 |
16493-16618 |
Sentence |
denotes |
The objective of the single cell platform is to enable dynamic visualization and analysis of single-cell RNA-sequencing data. |
T436 |
16619-16948 |
Sentence |
denotes |
Currently, there are over 30 scRNAseq studies available for analysis in the Single Cell app, including studies from human donors/patients covering tissues such as adipose tissue, blood, bone marrow, colon, esophagus, liver, lung, kidney, ovary, nasal epithelium, pancreas, placenta, prostate, retina, small intestine, and spleen. |
T437 |
16949-17101 |
Sentence |
denotes |
Because no pan-tissue reference dataset yet exists for humans, we have manually selected individual studies to maximally cover the set of human tissues. |
T438 |
17102-17269 |
Sentence |
denotes |
In some cases, these studies contain cells from both healthy donors and patients affected by a specified pathology such as ulcerative colitis (colon) or asthma (lung). |
T439 |
17270-17621 |
Sentence |
denotes |
There are also a number of murine scRNAseq studies covering tissues including adipose tissue, airway epithelium, blood, bone marrow, brain, breast, colon, heart, kidney, liver, lung, ovary, pancreas, placenta, prostate, skeletal muscle, skin, spleen, stomach, small intestine, testis, thymus, tongue, trachea, urinary bladder, uterus, and vasculature. |
T440 |
17622-17721 |
Sentence |
denotes |
Note that two of these murine studies (Tabula Muris and Mouse Cell Atlas) include ~20 tissues each. |
T441 |
17723-17759 |
Sentence |
denotes |
Single-cell data processing pipeline |
T442 |
17760-17944 |
Sentence |
denotes |
For each study, a counts matrix was downloaded from a public data repository such as the Gene Expression Omnibus (GEO) or the Broad Institute Single Cell Portal (Supplementary file 1). |
T443 |
17945-18154 |
Sentence |
denotes |
Note that this data has not been re-processed from the raw sequencing output, and so it is likely that alignment and quantification of gene expression was performed using different tools for different studies. |
T444 |
18155-18248 |
Sentence |
denotes |
In some cases, multiple complementary datasets have been generated from a single publication. |
T445 |
18249-18328 |
Sentence |
denotes |
In these cases, we have generated separate entries in the Single Cell platform. |
T446 |
18329-18337 |
Sentence |
denotes |
Table 1. |
T447 |
18339-18361 |
Sentence |
denotes |
Results of evaluation. |
T448 |
18362-18415 |
Sentence |
denotes |
Performance of approximately 2100 disease-gene pairs. |
T449 |
18416-18512 |
Sentence |
denotes |
Assoc score↓ Cohen’s d (+) Mann-W U norm. (-) Logistic log loss (-) Logistic Brier score (-) |
T450 |
18513-18551 |
Sentence |
denotes |
Cosine (w2v) 1.31 0.197 0.51 0.168 |
T451 |
18552-18587 |
Sentence |
denotes |
Raw PMI 2.07 0.0953 0.374 0.116 |
T452 |
18588-18636 |
Sentence |
denotes |
Raw PMI -log(pctile) 2.15 0.0947 0.355 0.111 |
T453 |
18637-18672 |
Sentence |
denotes |
Exp PMI 2.17 0.0897 0.356 0.109 |
T454 |
18673-18721 |
Sentence |
denotes |
Exp PMI -log(pctile) 2.21 0.0903 0.341 0.105 |
T455 |
18722-18766 |
Sentence |
denotes |
Raw Local Score 2.35 0.0828 0.312 0.0947 |
T456 |
18767-18824 |
Sentence |
denotes |
Raw Local Score -log(pctile) 2.28 0.0832 0.317 0.0963 |
T457 |
18825-18873 |
Sentence |
denotes |
Exp Local Score 2.34 0.0812 *0.301 *0.0915 |
T458 |
18874-18934 |
Sentence |
denotes |
Exp Local Score -log(pctile) *2.36 *0.0811 0.308 0.093 |
T459 |
18935-18972 |
Sentence |
denotes |
log(coocc) 2.24 0.097 0.348 0.105 |
T460 |
18973-19007 |
Sentence |
denotes |
Interpretation of the above table. |
T461 |
19008-19118 |
Sentence |
denotes |
Each row corresponds to an association score whereas each column corresponds to one of the evaluation metrics. |
T462 |
19119-19264 |
Sentence |
denotes |
A (+) in the column means a higher evaluation metric value, the better the association score in that row separates the positive and random pairs. |
T463 |
19265-19313 |
Sentence |
denotes |
A (-) means a lower evaluation metric is better. |
T464 |
19314-19415 |
Sentence |
denotes |
Note all the metrics are immune to linear rescalings; also the Mann-Whitney U score is nonparametric. |
T465 |
19416-19754 |
Sentence |
denotes |
While counts matrices have been generated using different technologies (e.g. Drop-Seq, 10x Genomics, etc.) and different alignment/pre-processing pipelines, all counts matrices were scaled such that each cell contains a total of 10,000 scaled counts (i.e. the sum of expression values for all genes equals 10,000 in each individual cell). |
T466 |
19755-19839 |
Sentence |
denotes |
All data were uniformly processed using the Seurat v3 package (Butler et al., 2018). |
T467 |
19840-19893 |
Sentence |
denotes |
In short, this pipeline involves the following steps. |
T468 |
19894-20045 |
Sentence |
denotes |
First, we identify 2000 variable genes across the given dataset and then perform linear dimensionality reduction by principal component analysis (PCA). |
T469 |
20046-20409 |
Sentence |
denotes |
Using the set of principal components which contribute >80% of variance across the dataset, we then do the following: (i) perform graph-based clustering to identify groups of cells with similar expression profiles (Louvain clustering), (ii) compute UMAP and tSNE coordinates for each individual cell (used for data visualization) and (iii) annotate cell clusters. |
T470 |
20410-20665 |
Sentence |
denotes |
Note that the three human pancreatic datasets (GSE81076, GSE85241, GSE86469) were integrated together in a shared multi-dimensional space using CCA (Canonical Correlation Analysis) and the integration method in the Seurat v3 package (Butler et al., 2018). |
T471 |
20666-20780 |
Sentence |
denotes |
Cell clustering and computation of dimensionality reduction coordinates were performed on this integrated dataset. |
T472 |
20782-20805 |
Sentence |
denotes |
Cell cluster annotation |
T473 |
20806-21031 |
Sentence |
denotes |
In cases where publicly deposited counts matrices are accompanied by author-assigned annotations for individual cells or clusters, we have retained these cell annotations for display in the platform and accompanying analyses. |
T474 |
21032-21416 |
Sentence |
denotes |
For any study which was not accompanied by a metadata file containing cluster annotations, we have manually labeled clusters based on sets of canonical ‘cluster-defining genes.’ In these cases, we have attempted to leverage annotations and descriptions of gene expression patterns described by study authors in the manuscript text and figures corresponding to the data being analyzed. |
T475 |
21418-21468 |
Sentence |
denotes |
Metrics to summarize cluster-level gene expression |
T476 |
21469-21535 |
Sentence |
denotes |
The platform allows users to query any gene in any selected study. |
T477 |
21536-21683 |
Sentence |
denotes |
The corresponding data is displayed in commonly employed formats including a series of violin plots and as a set of dimensionality reduction plots. |
T478 |
21684-21835 |
Sentence |
denotes |
Expression is summarized by listing the percent of cells expressing Gene G in each annotated cluster and the mean expression of Gene G in each cluster. |
T479 |
21836-22070 |
Sentence |
denotes |
To measure the specificity of Gene G expression to each Cluster C, we compute a Cohen’s D value which assesses the effect size between the mean expression of Gene G in cluster C and the mean expression of Gene G in all other clusters. |
T480 |
22071-22283 |
Sentence |
denotes |
Specifically, the Cohen’s D formula is given as follows: (MeanC - MeanA)/(sqrt(StDevC2 + StDevA2)) , where C represents the cluster of interest and A represents the complement of C (i.e. all other cell clusters). |
T481 |
22284-22461 |
Sentence |
denotes |
Note that this is functionally similar to the computation of paired fold change values and p-values between clusters which is frequently used to identify cluster-defining genes. |
T482 |
22463-22490 |
Sentence |
denotes |
Gene-gene cosine similarity |
T483 |
22491-22641 |
Sentence |
denotes |
Within the platform, we support the run-time computation of cosine similarity (i.e. 1 - cosine distance) between the queried gene and all other genes. |
T484 |
22642-22768 |
Sentence |
denotes |
This provides a measure of expression similarity across cells and can be used to identify co-regulated and co-expressed genes. |
T485 |
22769-22868 |
Sentence |
denotes |
Specifically, to perform this computation, we construct a ‘gene expression vector’ for each gene G. |
T486 |
22869-23000 |
Sentence |
denotes |
This corresponds to the set of CP10K values for gene G in each individual cell from the selected populations in the selected study. |
T487 |
23002-23071 |
Sentence |
denotes |
Profiling expression of coronavirus receptors in single-cell datasets |
T488 |
23072-23163 |
Sentence |
denotes |
For each single-cell dataset, we examined the expression of ACE2, TMPRSS2, ANPEP, and DPP4. |
T489 |
23164-23318 |
Sentence |
denotes |
We generally considered a cell population to potentially express a gene if at least 5% of cells from that cluster showed non-zero expression of this gene. |
T490 |
23319-23549 |
Sentence |
denotes |
For each dataset, we show a figure which includes a UMAP dimensionality reduction plot colored by annotated cell type along with identical plots colored by the expression level of each coronavirus receptor in all individual cells. |
T491 |
23550-23805 |
Sentence |
denotes |
In some cases, we also show violin plots from the platform which automatically integrate literature-derived insights to highlight whether there exist textual associations between the queried gene and the tissue/cell types identified in the selected study. |
T492 |
23807-23858 |
Sentence |
denotes |
FDA Adverse Event Reporting System (FAERS) analysis |
T493 |
23859-24075 |
Sentence |
denotes |
The FAERS application of the nferX platform supports viewing adverse event profiles of all marketed products through multiple lenses - Count, Proportional Reporting Ratio (PRR), and an nferX Adverse Event (AE) Score. |
T494 |
24076-24184 |
Sentence |
denotes |
AEScore=ln(count)∗1/(1+e−(prr−1.5)). Count is the raw number of reports between a drug and an adverse event. |
T495 |
24185-24373 |
Sentence |
denotes |
The proportional reporting ratio (PRR) is a simple way to get a measure of how common an adverse event for a particular drug is compared to how common the event is in the overall database. |
T496 |
24374-24732 |
Sentence |
denotes |
A PRR >1 for a drug-event combination indicates that a greater proportion of the reports for the drug are for the event than the proportion of events in the rest of the database, while a PRR of 2 for a drug event combination indicates that the proportion of reports for the drug-event combination is twice the proportion of the event in the overall database. |
T497 |
24733-24787 |
Sentence |
denotes |
The PRR is computed as follows:PRR=(m/n)/((M−m)/(N−n)) |
T498 |
24788-24829 |
Sentence |
denotes |
m = number of reports with drug and event |
T499 |
24830-24861 |
Sentence |
denotes |
n = number of reports with drug |
T500 |
24862-24906 |
Sentence |
denotes |
M = number of reports with event in database |
T501 |
24907-24940 |
Sentence |
denotes |
N = number of reports in database |
T502 |
24941-25016 |
Sentence |
denotes |
Count of an event with a query drug is a good first measure of association. |
T503 |
25017-25217 |
Sentence |
denotes |
But it has the problem that generally common events will often show up at the top, where we are often more interested in events that are differentially associated with the query drug over other drugs. |
T504 |
25218-25304 |
Sentence |
denotes |
An issue with PRR is that it is noisy when the total number of event reports is small. |
T505 |
25305-25701 |
Sentence |
denotes |
If there are three reports of some oddly specific event and one occurs with the query drug, that event will likely have a very high PRR, but it may not be the event we would be most interested in for a drug (in FAERS such rare events are often not even proper adverse events) - we want events that occur often, and also are differentially associated with a drug - a balance between count and PRR. |
T506 |
25702-25769 |
Sentence |
denotes |
The AE score tries to strike this balance in an all-in-one measure. |
T507 |
25770-25943 |
Sentence |
denotes |
It up-weights events that occur often for the query drug (this is the ln(count) term), and that are differentially associated with the query drug (this is the sigmoid term). |
T508 |
25944-25998 |
Sentence |
denotes |
The sigmoid(PRR-1.5) term ranges smoothly from 0 to 1. |
T509 |
25999-26030 |
Sentence |
denotes |
It's equal to 0.5 at PRR = 1.5. |
T510 |
26031-26140 |
Sentence |
denotes |
When PRR = 6, sigmoid(PRR-1.5)=0.99; so PRR values >= 6 are all treated roughly equivalently by the AE score. |
T511 |
26141-26361 |
Sentence |
denotes |
Thus, extremely high PRRs due to small counts will not swing the AE score much beyond PRR = 6, and the ln(count) term will down-weight those small-count cases, so that they do not show up at the top of the AE score list. |
T512 |
26362-26548 |
Sentence |
denotes |
A nice property of AE score is that, for a given query drug, the AE scores of the events with that drug turn out to roughly follow an exponential distribution, particularly at the tails. |
T513 |
26549-26623 |
Sentence |
denotes |
We can then fit exponential distributions to the scores, and analyze them. |
T514 |
26624-26833 |
Sentence |
denotes |
A benefit of the exponential fit is that we can make more robust claims about how significant a certain score is for a query drug, even if the empirical data is sparse/noisy at the tails for a particular drug. |