3.1. Principal Component Analysis and Principal Variance Components Analysis High-dimensional microarray data can be visualized by the reduction of dimensions by means of principal component analysis (PCA). The principal components, representing directions of the data with maximal variation, reveal similarities and differences between samples [52]. Figure 3 shows the first three principal components of the present microarray data set. Each sphere represents one sample and its color indicates the batch of sample processing, referred to as “run”. In the unnormalized data (Figure 3a), two groups of samples can be observed. The first three runs cluster into one group, the second group consists of the remaining three runs. One group consists of samples from three different histological types of lung cancer and their statistically matched controls, whereas the second group comprises all four used histological types of lung cancer and their matched controls, meaning that in both groups, three of the four used histological types are present. The PCA plot of the quantile normalized data is depicted in Figure 3b. This normalization method does not change the grouping of the batches compared to unnormalized data. The separation between the distinct runs in each group is even more obvious, especially between run 1, run 2, and run 3. By means of data adjustment with DWD (Figure 3c) and ComBat (Figure 3d), the previously observed groups become dispersed. In particular the ComBat algorithm achieves a more even distribution of the batches, whereas with DWD an accumulation of most samples in the center surrounded by a few samples outside of this center can be observed. Figure 3 Principal component analysis (PCA) plot showing the first three principal components of the microarray data set (a) unnormalized, (b) quantile normalized, (c) ComBat-adjusted, and (d) DWD-adjusted. Each sphere represents one sample; the color of the sphere indicates the different experimental runs. Run 1 to run 6 is represented by the following colors: run 1 = orange, run 2 = cyan, run 3 = dark blue, run 4 = yellow, run 5 = dark green, run 6 = pink. A novel method for estimation of sources for variability in microarray data is principal variance components analysis (PVCA). In this analysis, a threshold for the percentage of variability is defined and the number of principal components representing this percentage is included in the further steps [34]. Here, the threshold was set to 60%. The defined principal components (PCs) are matched to known sources of variation by variance component analysis (VCA). The resulting weighted proportion of variance shows the proportional effect of each source on the data set [28]. The PVCA analysis of the unnormalized data (Figure 4a) shows that the highest contribution (49.5%) to variation is due to the experimental run. The second highest contribution is due to undefined residual effects (38.6%) which represent the variance in the data set which cannot be explained by known factors. The third most weighted variance is contributed by the combination of experimental run and sample type (lung cancer case or control). When using quantile normalization (Figure 4b), the contribution of the experimental run increases to 52.5%. This effect can be reduced to 0% when using ComBat (Figure 4c) and to 0.001% by means of DWD (Figure 4d). The combination of the effects “experimental run” and “sample type” remain similar when using quantile normalization and ComBat, only with DWD a slight increase to 1.11% can be observed. Other factors, like age, sex, and smoking habit, do not have any effects (0.02%–0.06%) independent of the data pre-processing method. The sample type (lung cancer case or cancer-free control) has just a slightly higher effect on variance in each data set (0.11%–0.12%). Figure 4 Principal variance components analysis (PVCA) of the microarray data set (a) unnormalized, (b) quantile normalized, (c) ComBat-adjusted, and (d) DWD-adjusted. Contribution to variance was estimated for the following factors: run = experimental runs (run 1 to run 6), type = sample type (cancer or control), sex = female/male, age = age group (0–56, 56–64, 64–70, 70–100 years), smoking = smoking habit (current, never, former smoker), resid = residual weighted average proportion variance. Single factors were investigated as well as combinations of factors. 3