1. Introduction Epigenetic events comprise heritable modifications that regulate gene expression without altering the DNA sequence itself and can serve as regulatory mechanisms for a wide range of biological processes [1,2,3]. DNA methylation, one of the most common epigenetic effects that take place in the mammalian genome, is actually a chemical modification, resulting in the addition of a methyl (CH3) group at the carbon 5 position of the cytosine ring. Most cytosine methylation occurs in the sequence context 5′CG3′, also called the CpG dinucleotide, and human genome contains regions of unmethylated segments interspersed by methylated regions [4]. The first epigenetic change in human tumors -global genomic DNA hypomethylation- was reported way back in the early 1980s, at about the same time the first genetic mutation in an oncogene was discovered [5,6]. Cancer was the first group of diseases considered for DNAmethylation-targeted therapeutics [7] and DNA methylation is now implicated as a critical determinant in carcinogenesis, thus becoming a topic of intense investigation the recent years [8]. Specifically, hypermethylation of CpG islands, typically a sequence of 300–3000 base pairs in length within or near to approximately 40% of promoters [9], has been related to most types of cancer including solid tumors, e.g., breast, colon, lung, or hematologic forms, e.g. leukemias [4,10,11,12,13]). Widely epigenetic effects on gene expression and proteins involved in cancer manifestation are the hypermethylation of tumor suppressor genes and the aberrant expression of DNMT genes [14]. In the other hand, hypomethylation is gaining importance, as it has been observed frequently in solid tumors [15], such as hepatocellular and prostate cancers, and can be attributed to two forms. Firstly, hypomethylation of transcription regulatory regions can take place (e.g. see [16] for the prognostic value of demethylation of urokinase promoter in patients with breast carcinoma), however this is much less frequent than hypermethylation of CpG islands overlapping promoters [17]. Secondly, global DNA hypomethylation is observed frequently in cancers, e.g. hypomethylation of DNA cancer-hypomethylatedrepeats of tandem centromeric satellite α, juxtacentromeric (centromere-adjacent) satellite 2, interspersed Alu, and long interspersed elements repeats [17,18,19]. Hypomethylation of DNA is generally more pronounced with tumor progression or the degree of malignancy [13,20]. DNA methylation in cancer cells may be modified by a number of factors like aging of tissues, nutrition, and environment [21,22,23]. Recently, the rapid progress in microarray technologies has opened new avenues for the high-throughput monitoring of epigenetic effects [24,25]. One of the latest microarray platforms is the Illumina’s Infinium Human Methylation 450K Bead Chip, which can detect CpG methylation changes in more than 480,000 cytosines distributed over the whole genome [26]. The nature of DNA methylation data imposes serious restrictions concerning the successful porting of popular statistical tools and methodologies developed for transcriptomic analysis. Thus, pre-processing and analysis of targeted bisulfite sequencing microarrays is a challenging research area where no gold-standard methods have been proposed. From an artificial intelligence perspective, the identification of biomarkers is considered a complex task, where feature selection is indispensable prior to the classification task, which separates various physiological states (i.e., disease stages) [27,28]. The derivation of a small feature set that best explains the difference between the biological states is aiming to yield robust, well-performing classifiers and renders the problem computationally tractable. Feature selection represents, in general, a prerequisite for the setup of reliable classification models in the area of bioinformatics, given the usually high dimensionality of the feature spaces observed in microarray analyses [29]. While feature selection and classification methods have been comprehensively explored in the context of gene expression data, little work has been done on how to perform feature selection or classification in the context of epigenetic data. Given the importance of epigenomics in cancer and other complex genetic diseases, it is critical to identify the appropriate statistical methods to be used in this novel context. So far, related studies have focused on the derivation of biomarkers using cell lines or the 27K DNA Methylation Array by Illumina [30,31,32]. In the current study, we employ a data mining framework for the analysis of genome-scale epigenetic data that have been produced by Illumina’s Infinium Human Methylation 450K Bead Chip [26]. We aim to examine, retrospectively, the manifestation of two cancer types (breast cancer and B-cell lymphoma) through examination of genome-scale measurements of the DNA methylation observed in blood samples, collected from an Italian epidemiological cohort. At the time of collection all donors were considered healthy. In this sense, the aim is to discover retrospectively early molecular predictive markers of disease, probably years before its macroscopic observation, by correctly classifying samples in three disease related categories. These correspond to healthy, breast cancer, and B-cell lymphoma categories which, phenotypically, are considered essentially disparate and at the same time are extremely broad, encompassing very diverse combinations of molecular phenotypes (groups of genes or other molecules orchestrating disease emergence and progression). For our purposes, we have developed a workflow consisting of a feature selection module, three-class (control vs. two cancer types) or two-class (control vs. either cancer type, one cancer type vs. another) classification modules. Feature selection is based on two different methodologies: (i) an evolutionary algorithm, which belongs to the class of meta-heuristic optimization methods inspired by biological evolution and (ii) the GORevenge algorithm, a graph-theoretic methodology, published previously by [33], which exploits semantics, i.e. data represented on structured knowledge models like ontologies, included in the Gene Ontology (GO) tree. It is the first time to the authors’ knowledge that an artificial intelligence based pipeline is applied to the extended version of Illumina Bead Chip arrays. Data came from an Italian epidemiological cohort consisting of samples organized in control, breast cancer, and B-cell lympoma classes. The available samples have been randomly split into two independent datasets: (i) a training set used for feature selection, training various popular classifiers and their evaluation through resampling and (ii) a testing set, which consists of samples that have not been involved at all in the training of the classifiers and is used as an independent set for the application of a real-world evaluation scheme. The pre-processing methodology, previously presented by authors in [34], includes: (i) the correction of the methylation signals, using a novel intensity-based correction method and appropriate quality controls, and (ii) a statistical pre-selection of candidate CpG sites to be used for our data mining purposes in the current study. Data are analyzed through Rapidminer, a freely available open-source data mining platform that integrates fully the machine learning WEKA library, and additionally process and usage of data and metadata [35]. Results show that subsets of features, corresponding to CpG sites, delivered by the feature selection modules could represent predictive biomarkers for the two cancer types studied. Furthermore, encouraging classification performance measurements could be obtained by the series of classifiers. Gene enrichment and pathway analysis which followed evaluated the biological content of the subsets of CpG sites delivered by the two selection methods.