High-dimensional word embeddings for determining the significant global associations
Figure 1—figure supplement 1C illustrates two histograms generated from a random set of vectors (in the vector space generated by the Neural Network) where one distribution represents all vector pairs whose cosine similarity is less than 0.32 (deemed ‘not strong associations’) and the other distribution represents all vector pairs whose cosine similarity is greater than 0.32 (deemed ‘strong associations’). This can show how common a phenomenon it is to find word vector pairs that have very good cosine similarity values but yet not co-occur even once in the corpus. The ‘cosine similarity >= 0.32’ bar at zero value suggests that roughly 11% of vector pairs whose cosine similarity where greater than 0.32 (‘strong associations’) never occurred together even once in a document. It is also clear from the figure that albeit more of the mass of the ‘cosine similarity >= 0.32’ distribution is skewed to the right as expected (more co-occurrences and hence unsurprisingly larger cosine similarity values), there is a long tail of the ‘cosine similarity < 0.32’ distribution (very high co-occurrences but small cosine similarity). The long tail is a direct consequence of negative sampling—where vectors corresponding to common words that co-occur quite often with significant words in a sliding window are moved away from vectors of the other words.

What does the word2vec neural network do from the perspective of Genes-Diseases associations?
One way to view the word2vec ‘black box’ operation from a Genes/Diseases perspective (cosine of <Gene, Disease> for all Genes and Diseases) is as a Transfer Function which changed the input probability distribution (pre-training randomly assigned word vectors for Genes and Diseases) to a new probability distribution. The ‘null hypothesis’ (which seems to be well preserved in actuality in the way word2vec assigns random values to vectors initially) is the ‘green colored’ Cosine Distribution (Figure 1—figure supplement 1D). Once word2vec training is over, the final word vectors are placed in specific positions in the 300-dimensional space so as to present the ‘blue colored’ Empirical distribution (the actual cosine similarity between <Gene, Disease> pairs that we observe). The ‘orange curve’ is the 2-Gamma mixture (the parametric distribution that captures the ‘empirical distribution’ with just eight parameters (two alphas, two betas, 2 ts and two phis).
Observations from this analysis:
Note the ‘symmetrical’ cosine distribution after training becomes ‘Asymmetrical’ with a longer ‘right tail’. The asymmetry is the reason why Gamma distribution worked better than say, Gaussian, for the curve fit. The mean of the distribution gets shifted to the right after training as one would expect — the vectors during training are ‘brought together’ by parallelogram addition predominantly— explaining the shift to the right (negative sampling will cause a movement in the opposite direction, but that will disproportionately affect the ‘ultra-high frequency’ words, which get ‘more’ positively sampled and hence the 3-gamma with a bump near 0.6 happens for ultra-high frequency words).
The most interesting associations, by definition, are in the tail of the distribution.

What does varying the number of dimensions in the word2vec space do to the underlying cosine similarity distributions in a large textual corpora?
Figure 1—figure supplement 1E illustrates a cosine similarity probability density function (PDF) graph to visually describe the implementation of the word2vec-like Vector Space Model in various N-dimensional spaces. As described in the Materials and methods section, the system is a Semantic Bio-Knowledge Graph of nodes representing the words/phrases chosen to be represented as vectors and edge weights determined by measures of Semantic Association Strength (e.g. the cosine similarity between a pair of word embeddings represented as vectors in a large dimensional space). The cosine similarity ranges from 0 (representing no semantic association) to 1 (representing strongest association). This metric of association can reflect the contextual similarity of the entities in the Biomedical Corpora. The typical dimensionality used by our neural network for generating the Global Scores is n = 300 dimensions. This is because, as can be seen in the graph, the distribution is highly peaked with most of the mass centered around 0 -- that is, a randomly chosen pair of vectors typically are orthogonal or close to orthogonal. Furthermore, over 300 dimensions, the distributions all have sufficiently long tails with the most interesting (salient) biomedical associations.