Metrics based on training a 1-d logistic model In this test, we are discriminating between two classes (true association/non-association) based on one feature. We have two metrics based on fitting a 1-feature logistic curve to the data. (Figure 1—figure supplement 1A–B). Brier score: The Brier score is the average squared error of the logistic curve above: that is, for each labeled point, we square the vertical distance to the logistic curve, and average over all labeled points (Contributors to Wikimedia projects, 2005). Log loss (dansbecker, 2018): The logistic log loss is the average -log [model probability of true label] for each labeled point. If the model is perfect at the point, it incurs no loss. If it predicts 0.5, it incurs -log[0.5] loss. If it predicts ‘yes’ with certainty when the answer is ‘no’ it incurs infinite loss (a logistic function never touches 0 or one so this won’t happen in our case). Neg log percentile: For most of the scoring rules, we also include a -log(percentile) version of the rule. This is constructed as follows, for query q, token t, and score S(q, t): Compute the scores S(q, t’) for q with every token t’. Let R be the number of these that are nonzero. Take the rank r of S(q, t) among all nonzero S(q, t’). The neg log percentile score nlS(q, t) associated with S is -log(r/R) We do this to: control for differences across queries control for differences in the shapes of the distributions that different association scoring functions take. This procedure maps all the S(q, t’) to an Exponential(1) distribution. We chose Exponential(1) because it is simple, intuitively reasonable and many of the scores naturally seemed to be approximately exponential.