2.4. Methods for CpG sites Selection and Classification
Prior to the application of our feature selection and classification methodology, M-value correction and the two statistical components described above, were applied for the two separate experiments: controls vs. BCCA and controls vs. LYCA. For each of the experiment, the top 1% criterion in paired t-test bootstrap corrected p-values and a Scaled CV>1 criterion were combined in order to derive a finally-selected robust subset of CpG sites that are both significantly differentially-methylated and immunized against technical variation. Unifying the two selected robust subsets of CpG sites yielded a total of 5719 CpG sites (~1.2% of the genome-scale methylation array) and methylation measurements for these CpG sites comprise the feature vectors used in this study.
In order to select the most important features (CpG sites), corresponding to candidate epigenetic biomarkers and test their relevance in terms of accuracy they can provide when fed to popular classifiers, an appropriate workflow was built using the Rapidminer platform (Available online: www.rapid-i.com). An evolutionary feature selection algorithm within this workflow uses the training set and an embedded k-nn classifier, and was initially applied in order to select the most robust features, i.e., CpG sites, that distinguish the three samples types. Alternatively, CpG selection was performed using the GORevenge algorithm [33], that on a functional analysis basis selects genes (consequently CpG sites for our purposes), that possess an important role in the mechanism beneath the diseases studied in terms of centrality in the GO tree. Then, various classification modules, where the selected features were imported, i.e. k-nn (k-nearest neighbor) classifiers, a decision tree, and a feed-forward artificial neural network, were constructed and evaluated using the training set and leave-one-out resampling. Finally, all classifiers were tested using the totally unknown testing set. In the following, the feature selection methods (evolutionary selection or GORevenge-based selection) and the classifiers used are described.

2.4.1. Evolutionary Selection of CpG Sites
This feature selection method uses a genetic algorithm (GA) [40] to select the most robust features that feed an internally used, 12-nn weighted classifier [41]. GA mimics the mechanisms of natural evolution and applies biologically inspired operators on a randomly chosen initial population of candidate solutions, in order to optimize a fitness criterion. In the current study, an initial population of 500 random chromosomes is created during the first step of the GA. Each chromosome, representing a candidate solution for the feature selection task, is a binary mask of N binary digits (0 or 1), equal in number to the dimensionality of the complete features set, that reflect the use (or not) of the corresponding feature. Each chromosome is restricted to be a binary mask, corresponding to a solution that includes a subset of features, which number lies in a predefined range. The genetic operators of selection, crossover and mutation are then applied to the initial generation of chromosomes. The selection operator selects, through a roulette wheel scheme, the chromosomes to participate in the operators of crossover and mutation, based on the fitness value of each chromosome previously calculated, i.e., the total accuracy (number of samples correctly classified) that the corresponding feature subset yields using the 12-nn weighted classifier on a three-fold cross validation basis. The crossover operator mates random pairs of the selected chromosomes with probability Pc=0.5 based on a uniform crossover operator, while bits within a chromosome are mutated (switched from 0 to 1 or vice-versa) with probability Pm = 1/N when the mutation operator is applied. The whole procedure is repeated until the stopping criterion of attaining a maximum number of generations is reached and the best performing chromosome in the last generation is the one that represents the finally chosen feature subset. The k-nn weighted classifier was used internally in the evolutionary algorithm due to the rather low computational cost it raises, compared to other alternatives e.g., artificial neural network, and the need for executing and evaluating the classifier for a large number of rounds within feature selection algorithm. The setting k = 12 comes for that the least number of samples in the testing sets used is 13, so we would like the classifier (optimized here and applied later on the testing sets) not to examine a neighbor of samples greater in size than that of one class.

2.4.2. GORevenge-based Selection of CpG Sites
As we aimed at the application of an orthogonal to the evolutionary feature selection methodology, we incorporated to our study the selection of CpG sites from the pool of genes corresponding to the initially pre-selected CpG sites, exploiting their putative functional role through GORevenge (Available online: www.grissom.gr/gorevenge, [33]). Starting from a list of genes/gene ontology (GO) terms, GORevenge exploits the functional information included in the GO tree semantics and outputs a series of functionally related genes/GO terms. The finally selected genes/GO terms may be possibly not included in the inputted list; thus, it can aid the elucidation of hidden functional regulatory effects among genes and can therefore promote a system’s level interpretation.
GORevenge uses a stepwise mechanism, starting from the initially considered genes set or GO terms. In the first phase, genes are collected, not only when linked to a given GO term, but also to its neighboring ones, i.e., its parents and children GO terms. These genes are considered to belong to the same functional clique, which is defined by the use of distance based functional similarity criteria. For genes that are annotated by several terms, a pruning phase follows, where GO terms are eliminated when the in-between distance of those terms falls under certain similarity distance. GOrevenge incorporates Resnik semantic similarity metrics [42] and is able to probe specific categories of the GO, i.e., molecular function (MF), biological process (BP), and cellular component (CC). Finally, a prioritized list of genes is exported based on the GOs linked to them, after the pruning stage, thus measuring the centrality of the genes in the functional mechanism as proposed by initial genes/GOs.
GORevenge was applied here, using as input to the algorithm, the set of unique genes related to the pre-selected in terms of statistics CpG sites. Specifically, a set of unique 3415 genes ids were derived based in the pre-selected 5719 CpG sites and the annotation of the Infinium Human Methylation 450K BeadChip. We applied GORevenge using as input the set of 3415 genes. Resnik semantic similarity metric, Bubble genes algorithm, and a relaxation equal to 0.15 were used as algorithm parameters (see [33] for more details on the parameters). We retrieved 235 and 210 genes included in the list of genes submitted, using BP and MF functional aspects in the algorithm, respectively. The resulting genes correspond to a total of 249 unique gene ids, which in turn correspond to a total of 352 CpG sites based on the annotation of the chip. These features represent a small, conclusive set of simultaneously significantly differentiated and with high regulatory impact (they are linked with the highest number of distinct, highly non-overlapping, biological processes) genes, which may reliably be used for the training of predictive classifiers.

2.4.3. Classification
Following the selection of CpG sites (used as features from now on), classification is performed to construct classifiers fed by the selected features and measure their performance, thus validating the relevance of the selected features. Three nearest-neighbor classifiers (k-nn, k = 1, 6, 12) with weights, a classification tree (the GINI index was used as a split criterion), and a feed-forward artificial neural network (ANN) of one hidden layer were used. All classification algorithms, except ANN, are described in more detail in [28]. The ANN used here was trained using the back-propagation algorithm for 1000 epochs with a learning rate equal to 0.3 and momentum equal to 0.2 that were found to be the best choices on a trial and error basis. The hidden layer used a sigmoid activation function and contained ((num. of features+num. of classes)/2 + 1) nodes. Classifiers’ performance, in terms of total accuracy (number of samples correctly classified), and class sensitivity (number of true positives in a class that were correctly classified in this class) was measured using a training set and leave-one-out resampling (tabular results presented in Supplementary Material). The same classification algorithms were evaluated, utilizing the totally unknown-independent testing set: classifiers were constructed once using the training set and applied to the samples belonging to the testing set (results presented here in tabular format and bar plots).