Data mining analyses using random forests The Random Forest procedure [18-20], a data mining approach for variable selection and model building, was used to perform a preliminary screening of variables for each BOPH (DEH11, EPI8, HDL and WBC). The R statistical package [21] was used for the implementation of Random Forest. Variables that were known to be trivially related to the target BOPH were not included in the initial or subsequent Random Forest runs. In the initial Random Forest analyses, 10,000 trees were generated, and variable importance and cross-validated R-squared statistics were produced. The variable importance effectively ranks all variables in each data set with respect to their ability to predict the target BOPH. At the end, 30 predictor-sets together with cross-validated R-squared statistics were kept. From this list, a large, a medium, and a small predictor set was chosen for input into the Multivariate Adaptive Regression Spline (MARS) procedure [22]. In each case, the large predictor set was chosen to contain 30 variables, while medium and small data sets were chosen to represent natural cut points in the sequence of cross-validated R-squared values.