3.3. Loading Control Normalization
Loading controls are required to account for measurable effects caused by properties of the starting material e.g., unequal total protein concentration or different cell numbers. Thus, a normalization step is required to permit the actual data analysis. Different approaches have been described which are based on adjusting the total protein concentration of individual samples to a pre-defined value prior to spotting. Protein quantification assays such as BCA or Bradford are frequently employed to determine the total protein concentration prior to spotting. In general, the accuracy of total protein assays is restricted by chemical inference with certain compounds and limited by a short linear range, not to mention the additional time needed for the experimental protocol. Alternatively, signals can be adjusted post detection. Post-printing normalization with a total protein dye requires additional slides of a print run to be stained with a total protein dye, for example Fast Green FCF, Sypro Ruby or colloidal gold. Antibody-detected slides are normalized based on data of a corresponding normalizer slide via a spot-specific correction factor that reflects the deviation of the protein concentration determined from the median of all spots. Target protein signals are then corrected via division by the correction factors, rescaling can be carried out by multiplication of spot intensities with the median of the corresponding normalizer array. Housekeeping proteins such as β-Actin have been used to normalize RPPA data [33,34]. However, even housekeeping proteins are subjected to biological regulation and have therefore limited these approaches.
Different normalization approaches were specifically tailored towards the needs of RPPA data analysis, e.g., median loading, loading control, variable slope, and invariable protein set normalization, as reviewed in [35]. Median Loading (ML) normalization considers row and sample effects as additive at the log scale. The sample effect is estimated from the median protein expression estimates of the samples across all arrays. The main assumptions of the median loading approach is that all arrays are printed in a consistent manner and that changes observed for up- or downregulated proteins can still be seen after median normalization [36]. A key idea behind this approach is that the majority of target proteins assessed by RPPA will be comparable for the majority of samples. However, if a low number of target proteins are probed by RPPA or only proteins subjected to regulation will be measured, the ML approach will be biased. Loading control incorporates similar ideas, yet the value reflecting median expression is calculated individually for each target protein and then subtracted from a particular sample [35].
Variable Slope (VS) normalization [17] takes into account the independent nature of individually stained RPPA slides. A slide-specific value is determined and included in the additive sample and row effect model in a multiplicative manner, thus yielding slightly different response curves for different slides. This approach was coupled with the “joint sample” model implemented in the suite of R packages “SuperCurve” (Table 1). These “joint sample” models use all the information of the array together with the individual protein concentrations for each sample to estimate parameters. The array information is based on assumptions such as that the surface chemistry and therefore the interactions of antibodies probed on a slide probed with a specific antibody are similar. For example, information available for each dilution point about rate of signal increase is used to yield improved estimates of protein concentration with a lower variance. “SuperCurve” relies on a three-parameter logistic equation to model the dependency of signal intensities from unknown protein expression values.
Recently, Liu et al. [35] employed an approach initially introduced for the analysis of high-throughput expression profiling data for loading control and variance stabilization, which was based on the invariant marker set concept [36]. This concept was adjusted to RPPA specific settings by introduction of a set of invariant proteins, so-called markers that form a virtual reference sample to normalize all samples. First, target protein signals are ranked and the variance is calculated across all samples, and data showing the highest rank variance are removed from the RPPA data set. This selection process is repeated until the number of target-protein derived data has reached a pre-determined number. Then, in this way, the reduced data set is trimmed further by removing the 25% highest and 25% lowest values. Next, averaging the remaining values of every protein across all samples generates the virtual reference sample (VS). The actual sample data is then normalized with respect to the virtual reference sample by lowess smoothing using an MA-plot approach as described in Pelz et al. [36]. So-called MA-plots or Bland-Altman-plots are often used to visualize the distribution of pairwise comparisons in transcriptomic experiments. The x-axis presents the log2 gene expression level and the y-axis reflects log2 fold-change with respect to a reference sample. This concept was employed by the VS approach and showed promising results with respect to loading effect correction and variance stabilization, and resulted in RPPA data that showed a good correlation with IHC/fluorescence in situ hybridization (FISH) data available for the same set of samples.