Id |
Subject |
Object |
Predicate |
Lexical cue |
T309 |
0-74 |
Sentence |
denotes |
Unstructured biomedical knowledge synthesis and triangulation capabilities |
T310 |
75-258 |
Sentence |
denotes |
In order to capture biomedical literature-based associations, the nferX platform defines two scores: a ‘local score’ and a ‘global score’, as described previously (Park et al., 2020). |
T311 |
259-595 |
Sentence |
denotes |
Briefly, the local score is obtained from applying a traditional natural language processing technique which captures the strength of association between two concepts in a selected corpus of biomedical literature based on the frequency of their co-occurrence normalized by the frequency of each individual concept throughout the corpus. |
T312 |
596-786 |
Sentence |
denotes |
A higher local score between Concept X and Concept Y indicates that these concepts are frequently mentioned in close proximity to each other more frequently than would be expected by chance. |
T313 |
787-934 |
Sentence |
denotes |
The global score, on the other hand, is based on the neural network renaissance that has recently taken place in Natural Language Processing (NLP). |
T314 |
935-1065 |
Sentence |
denotes |
To compute global scores, all tokens (e.g. words and phrases) are projected in a high-dimensional vector space of word embeddings. |
T315 |
1066-1165 |
Sentence |
denotes |
These vectors serve to represent the ‘neighborhood’ of concepts which occur around a given concept. |
T316 |
1166-1391 |
Sentence |
denotes |
The cosine similarity between any two vectors measures the similarity of these neighborhoods and is the basis for our global score metric, where concepts which are more similar in this vector space have a higher global score. |
T317 |
1392-1632 |
Sentence |
denotes |
While the global scores in this work are computed in the embedding space of word2vec model, it can also be computed in the embedding space of any deep learning model including recent transformer-based models like BERT (Devlin et al., 2019). |
T318 |
1633-1794 |
Sentence |
denotes |
These may have complementary benefits to word2vec embeddings since the embeddings are context sensitive having different vectors for different sentence contexts. |
T319 |
1795-2033 |
Sentence |
denotes |
However, despite the context sensitive nature of BERT embeddings a global score computation for a phrase may still be of value given the score is computed across sentence embeddings capturing the context sensitive nature of those phrases. |
T320 |
2034-2260 |
Sentence |
denotes |
From a visualization perspective, the local score and global score (‘Signals’) are represented in the platform using bubbles where bubble size corresponds to the local score and color intensity corresponds to the global score. |
T321 |
2261-2386 |
Sentence |
denotes |
This allows users to rapidly determine the strength of association between any two concepts throughout biomedical literature. |
T322 |
2387-2545 |
Sentence |
denotes |
We consider concepts which show both high local and global scores to be ‘concordant’ and have found that these typically recapitulate well-known associations. |
T323 |
2546-2699 |
Sentence |
denotes |
One key aspect of the nferX platform is that it allows the user to query associated concepts for a virtually unbounded number of possible query concepts. |
T324 |
2700-2742 |
Sentence |
denotes |
This is achieved by means of two features: |
T325 |
2743-2995 |
Sentence |
denotes |
Firstly, the nferX platform allows the user to compose queries using the logical AND, OR and NOT operators to logically combine any number of biomedical concepts in a query, each combination amounting to a gross or nuanced composite biomedical concept. |
T326 |
2996-3498 |
Sentence |
denotes |
Secondly, since logical combinations yield a virtually unbounded number of biomedical concepts that can be queries, the nferX platform implements a completely dynamic method of computing local scores on the fly by using novel high performance parallel and distributed algorithms that, in real time, scan hundreds of millions of documents to quickly locate user query related text fragments and count co-occurring biomedical concepts for computing strength of association scores and their significances. |
T327 |
3499-3847 |
Sentence |
denotes |
The platform further leverages statistical inference to calculate ‘enrichments’ based on structured data, thus enabling real-time triangulation of signals from the unstructured biomedical knowledge graph various other structured databases (e.g. curated ontologies, RNA-sequencing datasets, human genetic associations, protein-protein interactions). |
T328 |
3848-4019 |
Sentence |
denotes |
This facilitates unbiased hypothesis-free learning and faster pattern recognition, and it allows users to more holistically determine the veracity of concept associations. |
T329 |
4020-4226 |
Sentence |
denotes |
Finally, the platform allows the user to identify and further examine the documents and textual fragments from which the knowledge synthesis signals are derived using the Documents and Signals applications. |