Note the ‘symmetrical’ cosine distribution after training becomes ‘Asymmetrical’ with a longer ‘right tail’. The asymmetry is the reason why Gamma distribution worked better than say, Gaussian, for the curve fit. The mean of the distribution gets shifted to the right after training as one would expect — the vectors during training are ‘brought together’ by parallelogram addition predominantly— explaining the shift to the right (negative sampling will cause a movement in the opposite direction, but that will disproportionately affect the ‘ultra-high frequency’ words, which get ‘more’ positively sampled and hence the 3-gamma with a bump near 0.6 happens for ultra-high frequency words).