In the most extreme scenario of simulation, i.e., Model 3, when there were no main G or E effects, the interaction effect was large (βGE1 = 2), and the unobserved exposure was rare (prob(E1 = 1) = 0.05), the distribution test was observed to be more powerful than the JLS-Fisher test (0.916 versus 0.406) (row 1 of Table S6). This is because the resulting phenotype distributions across genotypes differed in shape and their differences were not well captured by only mean and variance parameters. (The JLS-Fisher test could, in theory, be extended to include the skewness parameter or higher moments of such a distribution. The gain in power for this particular setting, however, would be associated with power loss for other approximately normally distributed traits.) In this scenario, when the phenotype deviated from normality, the LRT method appeared to be most powerful (power = 0.996). However, further analysis using permutation estimation of p values showed that power of the LRT under asymptotic analysis was greatly inflated, whereas the proposed JLS methods were robust (Table S7).