Introduction When a region in the brain is activated, oxygen and glucose demands lead to blood vessel dilation, followed by increased blood to the tissue (neurons and astrocytes) under stress. The onset of a neuronal activity triggers a sequence of physiological events in the blood vessels of the surrounding area, typically characterized by the changes in cerebral blood flow as well as concentration fluctuations of deoxyhemoglobin and oxyhemoglobin. The blood oxygenation level dependent (BOLD) signal from the FMRI scanning mainly captures the concentration changes of deoxyhemoglobin; that is, the BOLD signal is a surrogate and signature of neuronal activations plus various sources of noise (e.g., physiological and random fluctuations). As an indirect measure of neuronal activity, the shape of the BOLD response may hold some crucial features about brain function. However, the cascade of events from neural activation to measurable MRI signal is complex and nonlinear under certain regimes (Friston et al., 1998b; Birn et al., 2001; Logothetis and Wandell, 2004; Logothetis, 2008; Magri et al., 2012): Even though the BOLD response is simply interpreted as changes in neuronal processing, the same neuronal activity may evoke different hemodynamic response (HDR) shape across trials, regions, conditions/tasks, subjects, and groups. For example, neurophysiological confounds such as neurovascular coupling or energy consumption changes could lead to different BOLD response features, potentially explaining the HDR variability in magnitude and shape across brain regions, cognitive conditions and populations (e.g., children with autism vs. controls, Reynell and Harris, 2013). Nevertheless, meaningful interpretation as well as detection power in FMRI data analysis may depend on the accurate modeling of the BOLD response both at the individual subject and group levels (e.g., Buxton et al., 2004; Handwerker et al., 2004; Stephen et al., 2007; Barbé et al., 2012; Badillo et al., 2013). Under an experimentally-manipulated situation, the subject typically performs some tasks or is put under certain conditions in an event-related design, with each trial lasting for 2 s or less, and the HDR to each trial can be mathematically characterized by an impulse response function (IRF) that corresponds to a stimulus with a theoretically instantaneous duration and unit intensity. The voxel-wise EPI signal is then modeled through time series regression with explanatory variables (or regressors) of interest, each of which is constructed through the convolution between the stimulus timing and the IRF. In a block design, each task or condition has a duration of more than two seconds. As each block can be approximately considered as a sequence of events with an interval of scanning repetition time (TR), the theoretical HDR is usually hypothesized as the integral or linear summation of the consecutive IRFs, or the convolution of IRF over the stimulus duration. We typically adopt some formative mathematical functions (usually called HDR functions or HRFs) to approximate the HDR based on the experimental data with the assumption of linearity and time-invariance (or stationarity) (Marrelec et al., 2003), and consider three common approaches to modeling the average HDR across trials. The first one presumes a fixed shape IRF (e.g., gamma variate or wave form in AFNI, Cohen, 1997; canonical IRF in SPM, FSL, and NIPY, Friston et al., 1998a). With this model-based or fixed-shape method (FSM), the regression coefficient or β associated with each condition in the individual subject analysis reflects the major HDR magnitude (e.g., percent signal change). The second approach makes no assumption about the IRF's shape and estimates it with a set of basis functions. The number of basis functions varies depending on the kernel set and the duration over which the response is being modeled. A common approach to this estimated-shape method (ESM) consists of using a set of equally-spaced TENT (piecewise linear) functions or linear splines, and each of the resulting regression coefficient represents an estimate of the response amplitude at some time after stimulus onset. Regardless of the kernel set, however, ESM generates the same number of regressors as the number of basis functions (e.g., m) per condition or task, resulting in m regression coefficients which need to be considered simultaneously at the group level. In addition to the aforementioned TENT basis set, options for ESM at the voxel level include cubic splines, Legendre polynomials, sines, or user-defined functions in AFNI, and finite impulse function (FIR) in SPM, FSL, and NIPY, inverse logit (Lindquist et al., 2009), and high-order B-splines (Degras and Lindquist, 2014). In addition, the python package PyHRF offers an ESM at the parcel level through the joint detection-estimation framework (Vincent et al., 2014). It is of note that one significant advantage of adopting basis functions such as TENT or cubic splines is the flexibility of creating regressors through piecewise interpolation when the stimulus onset times are not aligned with the TR grids (e.g., the acquisition time is shorter than TR if one wants to present “silent trials” as a control condition to speech or other auditory stimulus). The third approach lies between the two extremes of FSM and ESM, and uses a set of two or three basis functions (Friston et al., 1998b). In this adjusted-shape method (ASM), the first basis (canonical IRF) captures the major HDR shape, and the second basis, the time derivative of the canonical IRF, provides some flexibility in modeling the delay or time-to-peak, while the third basis, dispersion curve (derivative relative to the dispersion parameter in the canonical IRF), allows the peak duration to vary. With one parameter per condition, FSM is the most efficient1 and statistically powerful among the three, if the presumed shape is reasonably close to the ground truth, and the group analysis strategies have been developed to reasonable maturity: The β values at the individual level are typically brought to the group level using the Student's t-test, permutation tests (Nichols and Holmes, 2002; Dehaene-Lambertz et al., 2006; Mériaux et al., 2006; Winkler et al., 2014), AN(C)OVA, general linear model (GLM) (Poline and Brett, 2012), multivariate modeling (MVM) (Chen et al., 2014), linear mixed-effects (LME) method (Bernal-Rusiel et al., 2013; Chen et al., 2013), or mixed-effect multilevel analysis (Worsley et al., 2002; Woolrich et al., 2004; Chen et al., 2012), with the assumption that each effect estimate is equally reliable across all subjects. However, deviations of the HDR from the presumed shape would result in biased estimates of the amplitude, in addition to failing to capture differences in shape such as during the undershoot or recovery phase. ESM is the most flexible among the three methods in terms of providing a more accurate characterization of the BOLD response and can achieve higher activation detection power in individuals. In addition, the estimated HDR curve with a unique signature shape offers much stronger support for the existence of activation than a single scaling factor or β value with FSM or ASM. Compared with FSM, ASM also results in a less biased response amplitude for the principal kernel, and can account for more variance compared to FSM; however, the common practice of using only the principal kernel's coefficient at the group level will not allow the detection of shape changes between conditions and or groups when those exist. Difficulties with using ESM (and to a lesser degree ASM) include the need for a larger number of kernel coefficients that need to be estimated. They requires m times more regressors than FSM in the individual subject analysis, which translates to more data points and scanning time to reach similar statistical power in individuals. Secondly, the risk of over-fitting exists when some confounding effects such as head motion and physiological noise are stimulus-locked and not fully accounted for. Lastly, the most challenging step lies at the group level: How to simultaneously handle those m effect estimates? And how to summarize and interpret the results? To avoid the complexity involved in the multiple effect estimates from ESM or ASM, the popular approach at the group level is dimensional reduction, condensing the shape information over the multiple values into one number. For ESM, one method is to sum over all or a subset of effect estimates (e.g., ignoring a few time points at the beginning and the end) to obtain the area under the curve (AUC) (e.g., Beauchamp et al., 2003; Greene et al., 2007; McGregor et al., 2013). As the BOLD response curve can be characterized by parameters such as amplitude (or height), delay (or time-to-peak), duration (or HWFM), another dimensional reduction proposal is to perform the group analysis on such a derived parameter from the estimated HDR (Lindquist et al., 2009; Degras and Lindquist, 2014). With two or three effect estimates per condition from ASM at the group level, the popular approach focuses on the β value of the canonical HDR while ignoring the parameters for the shape adjustments (i.e., the function of these other parameters is to absorb minor shape fluctuations that would otherwise be modeled as “noise”). One alternative is to estimate the HDR height using the Euclidean or L2-norm distance (L2D) of the two or three effect estimates (Calhoun et al., 2004; Lindquist et al., 2009; Steffener et al., 2010). Essentially, these dimensional reduction methods transform the effect estimates in an k-dimensional space ℝk to one-dimensional ℝ1. As information loss is unavoidable in the process, statistical power in activation identification would suffer. This raises the question of whether a more preferable approach to significance testing might better exploit the information in the HDR shape at the group level. A motivational example To demonstrate and compare various modeling approaches at the group level, we adopt the same experimental data used in our previous paper (Chen et al., 2014), with a typical group design that accounts for a confounding effect: varying age across subjects. Briefly, the experiment involved one between-subjects factor, group (two levels: 21 children and 29 adults) and one within-subject factor (two levels: congruent and incongruent conditions). Stimuli were large letters (either “H” or “S”) composed of smaller letters (“H” or “S”). For half of the stimuli, the large letter and the component letters were congruent (e.g., “H” composed of “H”s) and for half they were incongruent (e.g., “H” composed of “S”s). Parameters for the whole brain BOLD data on a 3.0 T scanner were: voxel size of 3.75 × 3.75 × 5.0 mm3, 24 contiguously interleaved axial slices, and TR of 1250 ms (TE = 25 ms, FOV = 240 mm, flip angle = 35°). Six runs of EPI data were acquired from each subject, and each run lasted for 380 s with 304 data points. The task followed an event-related design with 96 trials in each run, with three runs of congruent stimuli interleaved with three runs of incongruent stimuli (order counterbalanced across subjects). Subjects used a two button box to identify the large letter during global runs and the component letter during local runs. Each trial lasted 2500 ms: the stimulus was presented for 200 ms, followed by a fixation point for 2300 ms. Inter-trial intervals were jittered with a varying number of TRs, allowing for a trial-by-trial analysis of how the subject's BOLD response varied with changes in reaction time (RT). The experiment protocol was approved by the Combined Neuroscience Institutional Review Board at the NIMH, and the National Clinical Trials Identifier is NCT00006177. The EPI time series went through the following preprocessing steps: slice timing and head motion corrections, spatial alignment to a Talairach template (TT_N27) at a voxel size of 3.5 × 3.5 × 3.5 mm3, smoothing with an isotropic FWHM of 6 mm, and scaling each voxel time series by its mean value. The scaling step during preprocessing enables one to interpret each regression coefficient of interest as an approximate estimate of percent signal change relative to the temporal mean. The six runs of data were concatenated for the individual regression analysis with the discontinuities across runs properly handled (Chen et al., 2012). To capture the subtle HDR shape under a condition, two modeling approaches were adopted, ESM and ASM, for model comparison. With ESM, each trial was modeled with 10 tent basis functions, each of which spanned one TR (or 1.25 s). The subject's RT at each trial was incorporated as a per-trial modulation variable. In other words, two effects per condition were estimated in the time series regression at the individual level: one revealed the response curve associated with the average RT while the other showed the marginal effect of RT (response amplitude change when RT increases by 1 s) at each time point subsequent to the stimulus. In addition, the following confounding effects were included in the model for each subject, for each run: third-order Legendre polynomials accounting for slow drifts, incorrect trials (misses), censored time points with extreme head motion, and the six head motion parameters. The modeling strategy remained the same with ASM except that the three SPM basis functions (canonical IRF plus time and dispersion derivatives) were employed to model the BOLD responses instead of the 10 tents. At the group level, it is the BOLD effects associated with the average RT that are of interest here. In addition to the estimated HDR profiles, three other explanatory variables considered are: a) between-subjects factor, Group (two levels: children and adults), b) within-subject factors, Condition (two levels: congruent and incongruent), and c) quantitative covariate, age. The focus is on the interaction of HDR between Group and Condition: Do the two groups differ in the HDR profile contrast between the two conditions? Preview This paper is a sequel to our previous exploration (Chen et al., 2014) of the multivariate modeling (MVM) approach for FMRI group analysis. The layout is as follows. First, we explore and review various hypothesis testing strategies at the group level when the HDR is estimated through multiple basis functions. Second, simulation data were generated to reveal how each methodology performs in terms of controllability for false positives and false negatives, and the performance of these methods was assessed when they were applied to the experimental dataset at both individual and group levels. Finally, we compare all the modeling methodologies for ASM and ESM as well as with and without dimension reduction. The modeling strategies and testing methods discussed here are all performed at the voxel level. Multiple testing correction can be applied in the conventional fashion by controlling the false positive rate (Benjamini and Hochberg, 1995) or the family-wise error through Monte Carlo simulations (3dClustSim in AFNI, Forman et al., 1995) or random field theory (Worsley et al., 1992). Our major contribution here is to demonstrate the importance of accounting for shape differences and to offer testing approaches at the group level within an MVM platform with the modeling flexibility that would not be available under the conventional GLM. Through our demonstration we propose that ESM should be adopted whenever appropriate or possible to identify the nuanced differences in HDR shape that would be difficult or unlikely to be revealed through FSM or ASM. Furthermore, we recommend that the investigator report the effect estimates such as the HDR curves to substantiate the results in addition to the statistical significance. The modeling framework and functionality are available in the program 3dMVM for public use in the AFNI suite (Cox, 1996). Throughout this article, regular italic letters (e.g., α) stand for scalars, boldfaced italic letters in lower (a) and upper (X) cases for column vectors and matrices respectively. The word multivariate is used here in the sense of treating the effect estimates from the same subject or from the levels of a within-subject factor as the instantiations of simultaneous response (or outcome) variables (e.g., the effect estimates for the HDR). This usage differs from the popular connotation in the FMRI field when the spatial structure (multiple voxels) is modeled as the simultaneous response variables, including such methods as multivariate pattern analysis (Haxby, 2012), independent component analysis, and machine learning methods such as support vector machines. Major acronyms used in the paper are listed in Appendix A.