2. Experimental Section 2.1. Cohorts and Samples The study was conducted in the context of the European EnviroGenoMarkers project (Available online: www.envirogenomarkers.net) and involved individuals, from the European Prospective Investigation into Cancer and Nutrition study (EPIC-ITALY). DNA extraction from buffy coats, CpG methylation profiling (using the Illumina Infinium Human Methylation 450K platform, see Section 2.2) and the corresponding data quality assessment and preprocessing, were conducted as described previously [36]. In order to address unwanted technical variation in DNA methylation analysis, normalization was carried out in two successive steps of intensity-based correction (within-chip, followed by across-all-probes) as previously described ([34], see Section 2.2, as well) making use of the DNA methylation measured in multiple replicates of a technical quality control sample distributed among the study samples. The available Italian cancer dataset encompassed 261 samples, which correspond to 131 controls, 48 breast cancer cases (BCCA), and 82 Β-cell lymphoma cases (LYCA). The samples have originated from two separate experimental studies of matched control and case samples that aimed to study epigenetics effects towards lymphoma and breast cancer onset (See Table 1 for distribution of samples into the two cohorts). The dataset for breast cancer cohort has been deposited in NCBI’s Gene Expression Omnibus [37] and is accessible through GEO Series accession number GSE52635 (Available online: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52635). The study related to lymphoma provided a wealthier set of case samples. The primary classification to be studied here was the three-class (control vs. two cancer types). For this task, control samples have actually originated from the two different control samples subsets available for the separate experiments for lymphoma and breast cancer. The original control samples have been, thus, unified in a wider control sample set appropriate for the three-class problem studied. Actually, from the statistical point of view the broader size of the control class aids both the statistical power and the reliability of the analysis, as it maps to a much broader set of phenotypes, that correspond to the highly subjective, even with the medical expert’s standards, definition of the healthy person. For the three-class problem, the total of 261 samples organized in Controls, BCCA and LYCA were randomly spitted into two distinct sets: (i) training set (50 Controls, 35 BCCA, 46 LYCA) used for feature selection, training various classifiers and their evaluation through resampling and (ii) independent testing set (81 Controls, 13 BCCA, 36 LYCA) used for blind testing of classifiers. The sizes of the training set was the result of trying both to keep half of the samples in the training set (the rest for testing set) and forming a balanced occurrence of the internal classes, as well as a balanced occurrence of the two types of control samples originating from the separate studies. Balanced occurrence of samples in training set would allow a properly balanced set for training successfully the classifiers. After constructing the training set, remaining samples of all three classes were allocated to the testing set. Regarding the two class problems the subsets of samples in training and test sets were randomly drawn from the corresponding sets that formed for the three-class problem (control samples are now not unified, but separated as originated from the two separated studies). For the task control vs. LYCA, the training set included 28 controls and 46 cases, and the testing set included 55 controls and 36 cases. For the task control vs. BCCA, the training set included 22 controls and 35 cases, and the testing test included 26 controls and 13 cases. For the task BCCA vs. LYCA, the training set included 35 BCCA samples and 46 LYCA samples, and the testing test included 13 BCCA samples and 36 LYCA samples. microarrays-04-00647-t001_Table 1 Table 1 Distribution of samples in the cohorts used in the study. 2.2. DNA Methylation Measurements and Pre-Processing The dataset contains methylation data extracted using the more recent chip of Illumina, i.e., the Infinium Human Methylation 450K BeadChip, that includes 485,577 probes (482,421 CpG sites, 3091 non-CpG sites and 65 random SNPs) (see Table 2 for the genomic distribution of CpGs classified in different groups: promoter, body, 3′UTR and intergenic [26]). Methylation of each CpG site in this chip is measured based on two channel intensities IMeth and IUn-Meth, available for all probes, similarly with the two-channel arrays in transcriptome analysis. To date, two methods are used to measure DNA methylation. The first one is β-value, which is used to measure the percentage of methylation (ranging from 0 to 1). It is defined as β = IMeth/(IMeth + IUn-Meth). β-value, possesses an intuitively direct, biological interpretation, expressing roughly the amount of CpG methylation measured in the collected DNA extracted from a biological sample (population of cells), as percentage of methylation. The second method applied is the widely known from its use in gene expression microarray analysis, M-value [38]. The logarithmic ratio of the methylated versus the unmethylated signal intensities, quoted as M-value, M = log(IMeth/IUn-Meth), is used by this method as a DNA methylation measurement estimate. The M-value, used here to measure methylation of CpG sites, expresses the part of the given, probed epigenomic region, which has evaded methylation. It is also statistically more valid and applicable in differential and other statistical analyses compared to the alternative methylation metric of β-value, as it is approximately homoscedastic [38]. In agreement with the well-established statistical strategy for the adoption of transformations that render the data distributions symmetric, that is widely applied in the processing of microarray data, M-values were used to measure and correct DNA methylation signal intensities. This has been done using a novel intensity based method previously presented by authors in [34], which additionally uses quality controls (QCs), corresponding to the same technical control sample embedded in the methylation arrays. M-signal distribution is normalized, taking into account the average intensity level of both channels I = 0.5 × log(IMeth × IUn-Meth) and the variation of QCs incorporated in each chip which are used to estimate error estimates. microarrays-04-00647-t002_Table 2 Table 2 Genomic distribution of CpGs classified in different groups: promoter, body, 3′UTR and intergenic [26]. The normalization takes place in two successive steps: (i) within-chip; and (ii) across all probes. Normalization within chip incorporates the calculation of an error estimator across all intensity levels after partitioning the intensity space I into percentiles. The probe estimates for all arrays are then updated: the corrected M-value of a probe results by subtracting the respective error calculated for the corresponding intensity level. Normalization across all probes is applied next exploiting the standard deviation of the M-values for each probe across all QCs. The M-values of each probe across all samples are then updated, by subtracting the probe based error estimate. In whole, the impact of technical bias in the signal estimates is mitigated through the consecutive normalization steps as already shown in [34]. 2.3. Statistical Pre-Selection of CpG Sites Two statistical components have been used to pre-select a wide subset of candidate biomarkers prior to applying the data mining framework presented next. Statistics were applied for two separate experiments: controls vs. BCCA or controls [39] vs. LYCA. The goal was to pre-select a subset of CpG sites, in order to further feed the data-mining framework. These actually correspond to differentially-methylated CpG sites, with a statistical significance and beyond technical variation, as extracted using the statistical components for each of the separate experiments. These components correspond to (i) a scaled coefficient variation (Scaled CV) measurement and (ii) p-value measurements extracted by t-test and corrected by bootstrap. Scaled CV represents a robust measure of the real inter-class variability observed for a probe in the whole sample pool (controlsUcases), when compared to that observed among quality control samples, which measures solely the technical variation. The greater scaled CV is, the greater the real differential methylation is (beyond technical signal variation) and more reliable is the CpG site, as a candidate differentially methylated CpG site between the two sample categories (controls vs. cases). The bootstrap corrected p-value measurements originate from a typical paired t-test for extracting statistically significant differentially-methylated probes (controls vs. BCCA or controls vs. LYCA). A paired t-test was possible since cases have been matched with controls for each of the experiment in terms of certain characteristics, e.g., age, sex, body mass index, and pre-post menopause for the breast cancer samples and their controls. The classical statistical test is followed, however, by a bootstrap p-value correction that immunizes statistical findings, against the detrimental effect of multiple hypothesis bias. The idea beneath this p-value correction is to examine whether the p-values obtained from the statistical test are indeed that extreme, or they could represent random false selections (see [34]). 2.4. Methods for CpG sites Selection and Classification Prior to the application of our feature selection and classification methodology, M-value correction and the two statistical components described above, were applied for the two separate experiments: controls vs. BCCA and controls vs. LYCA. For each of the experiment, the top 1% criterion in paired t-test bootstrap corrected p-values and a Scaled CV>1 criterion were combined in order to derive a finally-selected robust subset of CpG sites that are both significantly differentially-methylated and immunized against technical variation. Unifying the two selected robust subsets of CpG sites yielded a total of 5719 CpG sites (~1.2% of the genome-scale methylation array) and methylation measurements for these CpG sites comprise the feature vectors used in this study. In order to select the most important features (CpG sites), corresponding to candidate epigenetic biomarkers and test their relevance in terms of accuracy they can provide when fed to popular classifiers, an appropriate workflow was built using the Rapidminer platform (Available online: www.rapid-i.com). An evolutionary feature selection algorithm within this workflow uses the training set and an embedded k-nn classifier, and was initially applied in order to select the most robust features, i.e., CpG sites, that distinguish the three samples types. Alternatively, CpG selection was performed using the GORevenge algorithm [33], that on a functional analysis basis selects genes (consequently CpG sites for our purposes), that possess an important role in the mechanism beneath the diseases studied in terms of centrality in the GO tree. Then, various classification modules, where the selected features were imported, i.e. k-nn (k-nearest neighbor) classifiers, a decision tree, and a feed-forward artificial neural network, were constructed and evaluated using the training set and leave-one-out resampling. Finally, all classifiers were tested using the totally unknown testing set. In the following, the feature selection methods (evolutionary selection or GORevenge-based selection) and the classifiers used are described. 2.4.1. Evolutionary Selection of CpG Sites This feature selection method uses a genetic algorithm (GA) [40] to select the most robust features that feed an internally used, 12-nn weighted classifier [41]. GA mimics the mechanisms of natural evolution and applies biologically inspired operators on a randomly chosen initial population of candidate solutions, in order to optimize a fitness criterion. In the current study, an initial population of 500 random chromosomes is created during the first step of the GA. Each chromosome, representing a candidate solution for the feature selection task, is a binary mask of N binary digits (0 or 1), equal in number to the dimensionality of the complete features set, that reflect the use (or not) of the corresponding feature. Each chromosome is restricted to be a binary mask, corresponding to a solution that includes a subset of features, which number lies in a predefined range. The genetic operators of selection, crossover and mutation are then applied to the initial generation of chromosomes. The selection operator selects, through a roulette wheel scheme, the chromosomes to participate in the operators of crossover and mutation, based on the fitness value of each chromosome previously calculated, i.e., the total accuracy (number of samples correctly classified) that the corresponding feature subset yields using the 12-nn weighted classifier on a three-fold cross validation basis. The crossover operator mates random pairs of the selected chromosomes with probability Pc=0.5 based on a uniform crossover operator, while bits within a chromosome are mutated (switched from 0 to 1 or vice-versa) with probability Pm = 1/N when the mutation operator is applied. The whole procedure is repeated until the stopping criterion of attaining a maximum number of generations is reached and the best performing chromosome in the last generation is the one that represents the finally chosen feature subset. The k-nn weighted classifier was used internally in the evolutionary algorithm due to the rather low computational cost it raises, compared to other alternatives e.g., artificial neural network, and the need for executing and evaluating the classifier for a large number of rounds within feature selection algorithm. The setting k = 12 comes for that the least number of samples in the testing sets used is 13, so we would like the classifier (optimized here and applied later on the testing sets) not to examine a neighbor of samples greater in size than that of one class. 2.4.2. GORevenge-based Selection of CpG Sites As we aimed at the application of an orthogonal to the evolutionary feature selection methodology, we incorporated to our study the selection of CpG sites from the pool of genes corresponding to the initially pre-selected CpG sites, exploiting their putative functional role through GORevenge (Available online: www.grissom.gr/gorevenge, [33]). Starting from a list of genes/gene ontology (GO) terms, GORevenge exploits the functional information included in the GO tree semantics and outputs a series of functionally related genes/GO terms. The finally selected genes/GO terms may be possibly not included in the inputted list; thus, it can aid the elucidation of hidden functional regulatory effects among genes and can therefore promote a system’s level interpretation. GORevenge uses a stepwise mechanism, starting from the initially considered genes set or GO terms. In the first phase, genes are collected, not only when linked to a given GO term, but also to its neighboring ones, i.e., its parents and children GO terms. These genes are considered to belong to the same functional clique, which is defined by the use of distance based functional similarity criteria. For genes that are annotated by several terms, a pruning phase follows, where GO terms are eliminated when the in-between distance of those terms falls under certain similarity distance. GOrevenge incorporates Resnik semantic similarity metrics [42] and is able to probe specific categories of the GO, i.e., molecular function (MF), biological process (BP), and cellular component (CC). Finally, a prioritized list of genes is exported based on the GOs linked to them, after the pruning stage, thus measuring the centrality of the genes in the functional mechanism as proposed by initial genes/GOs. GORevenge was applied here, using as input to the algorithm, the set of unique genes related to the pre-selected in terms of statistics CpG sites. Specifically, a set of unique 3415 genes ids were derived based in the pre-selected 5719 CpG sites and the annotation of the Infinium Human Methylation 450K BeadChip. We applied GORevenge using as input the set of 3415 genes. Resnik semantic similarity metric, Bubble genes algorithm, and a relaxation equal to 0.15 were used as algorithm parameters (see [33] for more details on the parameters). We retrieved 235 and 210 genes included in the list of genes submitted, using BP and MF functional aspects in the algorithm, respectively. The resulting genes correspond to a total of 249 unique gene ids, which in turn correspond to a total of 352 CpG sites based on the annotation of the chip. These features represent a small, conclusive set of simultaneously significantly differentiated and with high regulatory impact (they are linked with the highest number of distinct, highly non-overlapping, biological processes) genes, which may reliably be used for the training of predictive classifiers. 2.4.3. Classification Following the selection of CpG sites (used as features from now on), classification is performed to construct classifiers fed by the selected features and measure their performance, thus validating the relevance of the selected features. Three nearest-neighbor classifiers (k-nn, k = 1, 6, 12) with weights, a classification tree (the GINI index was used as a split criterion), and a feed-forward artificial neural network (ANN) of one hidden layer were used. All classification algorithms, except ANN, are described in more detail in [28]. The ANN used here was trained using the back-propagation algorithm for 1000 epochs with a learning rate equal to 0.3 and momentum equal to 0.2 that were found to be the best choices on a trial and error basis. The hidden layer used a sigmoid activation function and contained ((num. of features+num. of classes)/2 + 1) nodes. Classifiers’ performance, in terms of total accuracy (number of samples correctly classified), and class sensitivity (number of true positives in a class that were correctly classified in this class) was measured using a training set and leave-one-out resampling (tabular results presented in Supplementary Material). The same classification algorithms were evaluated, utilizing the totally unknown-independent testing set: classifiers were constructed once using the training set and applied to the samples belonging to the testing set (results presented here in tabular format and bar plots).