Simulations of group analysis with different testing methods As the spatial extent of FMRI data analysis is independently controlled through false positive rate or family-wise error, the simulations here were performed at a voxel to examine and compare the false positives and power performance among the testing methods. Simulated data were generated with the following parameters, imitating a typical FMRI group analysis with six scenarios (top row in Figure 1): a) one group of subjects with a small undershoot at the end of HDR curve; b) one group of subjects with a moderate undershoot at the end; c) two homoscedastic groups (same variance between groups) with equal number of subjects in each with a similar HDR profile but a factor of 2 difference in amplitude; d) two homoscedastic groups with equal number of subjects in each with HDR having the same amplitude but with a 2 s difference in peak location; e) two heteroscedastic groups (different variance between groups) with equal number of subjects in each with a similar HDR profile but a factor of 2 difference in amplitude; and f) two heteroscedastic groups with equal number of subjects in each with HDR having the same amplitude but with a 2 s difference in peak location. The HDRs are presumably estimated through 7 basis functions (e.g., TENT in AFNI) at the individual level, and the associated 7 effect components {βi, i = 1, 2, …, 7} at the TR grids are assumed to follow a multivariate Gaussian distribution with a first order autoregressive AR(1) structure for their variance-covariance matrix Σ=σ2[1ρρ2...ρ6ρ1ρ...ρ5⋮⋮⋮⋮⋮ρ6ρ5ρ4...1]. The choice of a simple Σ structure here is to allow manageable number of simulations while in the same time providing a reasonable structure similar to the one adopted for the Gaussian prior in Marrelec et al. (2003) that guarantees the HDR smoothness. To explore the impact of sample size, the number of subjects in each group was simulated at n = 9, 12, 15, 18, 21, 24, 27, 30 with ρ = 0.3 for each of the six scenarios. The standard error σ varied (shown in Figure 1) across the scenarios to obtain comparable power for each n. 5000 datasets were simulated, each of which was analyzed through 3dMVM with two explanatory variables, Group (between-subjects factor with 2 levels) and Component (within-subject factor with 7 levels that are associated with the 7 basis functions). False positive rate (FPR) and power were assessed by counting the datasets with their respective F- or t-statistic surpassing the threshold corresponding to the nominal significance level of 0.05. Similarly, one- or two-sample t-test was performed on the AUC and L2D values respectively. Figure 1 Simulation parameters and results. The six rows correspond to the scenarios in which the presumed HDRs (first column) with a poststimulus undershoot were generated by the convolution program waver in AFNI, and sampled at TR = 2 s (shown with vertical dotted lines): (1) one group with a small (1a, σ = 1.8) and a moderate (1b, σ = 1.8) undershoot, (2) two homoscedastic groups with the same HDR shape but different amplitudes (2a, σ = 0.5) and with same peak amplitude but a difference of two seconds in peak location (2b, σ = 0.3), (3) two heteroscedastic groups with the same HDR shape but different amplitudes (3a, σ = 0.3) and with same peak amplitude but a difference of two seconds in peak location (3b, σ = 0.3). FPR and power are shown in the second and third columns with a varying number of subjects in each group at a temporal correlation coefficient ρ of 0.3 under six testing approaches: XUV, LME, MVT, XMV, AUC, and L2D. The curves for FPR and power were fitted to the simulation results (plotting symbols) through LOESS smoothing with second order local polynomials. Among the six scenarios, all the testing methods showed proper control of FPR except for L2D with one group of subjects. L2D exhibits high power but at the cost of poor FPR control. This is in part due to the reduction of effect estimates to a positive value regardless the signs of the individual components in ESM. It is possible to reduce this problem in ASM when the sign of the principal kernel is assigned to the resulting L2D measure as shown in (7) and (8). Also, L2D achieved the lowest power with two groups of subjects. AUC simply sums over all the components, significantly misrepresenting the effects when the undershoot becomes moderate. This is reflected in the results where reasonable power is achieved when the undershoot is small and lower power is obtained when the undershoot is moderate. With two groups, AUC performed well in power when the two groups had the same HDR shape, but behaved as poorly as L2D when the two groups had different HDR shapes. As expected, AUC is only sensitive to peak amplitude differences, but is insensitive to shape subtleties. Except for L2D and AUC, the other methods tend to converge in power when the sample size is large enough (e.g., 30 or more). With one group, LME outperformed all other candidates. XUV had a balanced performance on power among all the scenarios, constantly surpassing XMV. Lastly, MVT was slightly more powerful than XUV with two groups when their HDRs were of the same shape with a large number of subjects (e.g., 20 or more per group). In summary, our simulations show that LME is preferred when there is only one group of subjects with no other explanatory variables present. Under other circumstances, XUV is the preferred choice, especially with the typical sample size of most studies, while MVT, AUC, and XMV may provide some auxiliary detection power.