What does the word2vec neural network do from the perspective of Genes-Diseases associations?
One way to view the word2vec ‘black box’ operation from a Genes/Diseases perspective (cosine of <Gene, Disease> for all Genes and Diseases) is as a Transfer Function which changed the input probability distribution (pre-training randomly assigned word vectors for Genes and Diseases) to a new probability distribution. The ‘null hypothesis’ (which seems to be well preserved in actuality in the way word2vec assigns random values to vectors initially) is the ‘green colored’ Cosine Distribution (Figure 1—figure supplement 1D). Once word2vec training is over, the final word vectors are placed in specific positions in the 300-dimensional space so as to present the ‘blue colored’ Empirical distribution (the actual cosine similarity between <Gene, Disease> pairs that we observe). The ‘orange curve’ is the 2-Gamma mixture (the parametric distribution that captures the ‘empirical distribution’ with just eight parameters (two alphas, two betas, 2 ts and two phis).
Observations from this analysis:
Note the ‘symmetrical’ cosine distribution after training becomes ‘Asymmetrical’ with a longer ‘right tail’. The asymmetry is the reason why Gamma distribution worked better than say, Gaussian, for the curve fit. The mean of the distribution gets shifted to the right after training as one would expect — the vectors during training are ‘brought together’ by parallelogram addition predominantly— explaining the shift to the right (negative sampling will cause a movement in the opposite direction, but that will disproportionately affect the ‘ultra-high frequency’ words, which get ‘more’ positively sampled and hence the 3-gamma with a bump near 0.6 happens for ultra-high frequency words).
The most interesting associations, by definition, are in the tail of the distribution.