2.2. Related Works In this section we highlight the main features and capabilities of several available software tools to analyze and validate survival data. Below, the main features and capabilities of some well-known stand-alone and web-based tools used to analyze survival data are summarized. Partial Cox regression analysis (PCR) [15] is a collection of command line R scripts written using the R code language. Partial Cox regression analysis is based on a partial Cox regression method for constructing mutually uncorrelated components based on microarray gene expression data for predicting the survival of future patients. The R code of PCR is available upon request. Significance analysis of microarrays (SAM) [16] is a statistical software for finding significant genes in a set of microarray experiments. The data should be put in an Excel spreadsheet and arranged as follows: the first row of the spreadsheet has information about the response variable, whereas all remaining rows contain gene expression data, one row per gene. The response variable may be a grouping like (untreated, treated), a quantitative variable (like blood pressure), or a possibly censored survival time. SAM is used to identify genomic features correlated with biological and/or clinical phenotypes of interest, including time-to-event clinical outcome. SAM is freely available after registration as an R package or Excel add-in. GenePattern [17] is a software based on a simple graphical user interface (GUI) used to provide access to a broad number of computational methods represented to the user as graphical modules, used to analyze genomic data. To analyze a particular dataset, users have to manually define a customized pipeline using the available modules. Finally, among the other statistical capabilities provided by GenePattern, it is possible to create and visualize survival curves based on censored data arranged in a “cls” file. GenePattern is freely available, but before the first run the GenePattern registration page will appear in your browser (modules run only on the GenePattern server). PSPP (https://www.gnu.org/software/pspp/) is a free software application (free alternative to SPSS) for statistical analysis of sampled data. It comes with a graphical user interface making it easy to use the available scientific capabilities. Furthermore, for the advancer user, it is possible to use PSPP by the command line interface to obtain better performance. R (https://www.r-project.org) is an open source software environment for statistical and mathematical computing and graphics. Moreover, R can be considered as a different implementation of the S language, making it possible to write code that runs under R. SPSS (http://www-01.ibm.com/software/analytics/spss/) is a software package developed to help users in statistical data analysis process. In addition to statistical analysis SPSS provide to the users data mining, text analytics, and data collection capabilities. MATLAB (www.mathworks.com/products/matlab/) (matrix laboratory) is a numerical computing environment. MATLAB allows a variety of statistical tool to manipulate data and graphical techniques to produce quality plots. On the other hand, MATLAB introduces its proprietary programming language with which it is possible to use all the functionalities available to MATLAB, including the implementation of algorithms, creation of user interfaces, and so on. Kaplan-Meier Plotter [18] is an integrated database (containing breast, ovarian, and lung cancer information) and an online tool capable of uni/multivariate analysis for in silico validation of new biomarker candidates in non-small cell lung cancer. Univariate and multivariate Cox regression analysis, Kaplan-Meier survival plots with hazard ratio and log-rank p-values are calculated and plotted by using R. Net-Cox [19] is a network-based Cox regression model used for a large-scale survival analysis across multiple ovarian cancer datasets. Datasets have to be arranged as a tabular delimited text file in order to be compatible with Net-Cox. In particular, compatible datasets are obtained by combining cancer information and patient clinical information stored in separate files into a unique file arranged as textual matrix. Net-Cox is freely available as a MATLAB plug-in. Survcomp [20] is an R-based Bioconductor freely-available package for survival risk model comparison. The survcomp package provides functions to assess and compare the performance of risk prediction (survival) models. cBioPortal [21] is a web-based resource for cancer genomics providing visualization, analysis, and downloading of large-scale cancer genomics datasets. All functionalities in cBioPortal visualization, querying, and analysis in cBioPortal are easy to use by means of a graphical user interface, including the analysis of survival data analysis based on Kaplan-Meier, and log-rank test to compare multiple survival curves. cBioPortal is also available as CGDS-R/MATLAB packages. Providing a basic set of functions for querying the cancer genomic data server (CGDS) via the R/MATLAB platform for statistical computing. PrognoScan [10] is a freely-available web-database usable through a simple graphical user interface. PrognoScan makes it easy for users to search the relation between gene expression and patient prognosis, such as overall survival (OS) and disease-free survival (DFS) across a large collection of publicly available cancer microarray datasets. PrognoScan includes the Kaplan-Meier estimator and the plotter function to plot Kaplan-Meier curves. SurvExpress [22] is a comprehensive gene expression database and web-based tool providing survival analysis and risk assessment in cancer datasets using a biomarker gene list as: Kaplan-Meier plots, the log-rank test of differences between groups, and the hazard ratio estimate. All of the tools listed above have been developed for the analysis and visualization of gene expression data, with the exception of R, MATLAB, SPSS, and PSPP, which can be classified as general numerical computing environments. Each tool used for analyzing input datasets requires that datasets meet specific criteria; otherwise the tools cannot analyze the data. For example, to analyze gene expression data with SAM data must be put into an Excel spreadsheet, where the first row contains the response measurements, one per column, starting from the third column. The remaining rows contain gene expression measurements, one row for every gene. “Column1” contains the gene name, “Column2” contains the Gene ID, whereas the remaining columns contain the expression measurements as numbers. Missing expression measurements should be reported as either blank or non-numeric values. It is worthy to note that the file format defined by the SAM tool is not compatible with the file formats defined by other software instruments. The tools developed to analyze gene expression data cannot deal with non-numerical value, limitation that highlights the impossibility for SAM and the other gene expression tools to analyze the DMET dataset since the DMET dataset contains only non-numerical values. Different assessments must be made for general numerical analysis environments when they are used to analyze DMET datasets. Although they are numerical analysis tools they are not intended to work directly with SNP datasets, the reason for which SPSS, R, and so forth can analyze OS-datasets only after the dataset has been manually converted in a format compatible with the tool used for the analysis. To the best of our knowledge, there are currently no tools for import/export of DMET/OS dataset in R, SPSS, MATLAB, and so on. For example, analyzing an OS-dataset with SPSS requires a significant effort for the users because he/she has to manually convert each of 1936 probes (rows) times the number of subjects (columns). Whereby, an OS-dataset with 100 subjects requires 193,600 conversions to convert each SNP into a numerical value (i.e., assigning the value 1 to A/A, 2 to A/C and so on). Moreover, in addition to translation of SNP, users must take into account the removal of special symbols such as “NoCall”, “ZeroCopyNumber”, “RareAllele”, and so on, because they are not useful for the data analysis. After this conversion step the user can upload the file in SPSS and must plot each probe individually, one after the other, in order to identify significant correlations between SNPs and overall survival. Another limitation of converting SNPs into numbers is that when users plot the overall survival curves, users lose the information about the SNP and the related curve. This correlation has to be remembered by the user that has to manually associate the numerical value to the plotted SNPs. Another way to analyze DMET data with general numerical analysis environments is to write a custom code able to import and convert the DMET dataset in a compatible format with the tool. This requires advanced programming skills and,it is time consuming and error prone. However this is a task not easy to pursue by life scientists, who only need to analyze data and to obtain clues about correlations among overall survival and SNP. Conversely, OSAnalyzer is different from the tools listed above due to the fact that to the best of our knowledge, it is the unique that come with the capability to automatically analyze a whole DMET dataset annotated with clinical data. Automatic data analysis makes it possible to highlight which genomic features are useful for the association with clinical outcomes, including, for example, the response to certain treatments and prognosis of the patients under specific clinical scenarios. Thus, data analysis become straightforward, without users having to worry about how data should be arranged, which settings use or, even worse, manually investigate all the dataset to detect significant clues. Manual analysis is time-consuming and may increase the probability to introduce mistakes reducing the accuracy of the results. Instead, the use of OSAnalyzer avoids wasting time on the manual analysis of all probes in order to figure out which probes are relevant from an overall survival or PFS point of view [9,23]. On the other hand, the current version of OSAnalyzer cannot analyze gene expression data in order to plot overall survival curves and it presents limited data analysis functionalities. Due to space limitation, we present only the comparison between OSAnalyzer and SPSS to prove the reliability of our tool. To be able to analyze OS-datasets with SPSS the first operation is to convert the literal SNP symbols contained into the OS-dataset into numerical values. To speed up the translation process, we used regular expressions to convert each SNP in the dataset into a unique numerical identifier. The translated file is depicted in Figure 2. After loading the converted OS-dataset, the survival analysis in SPSS can be done by using the configuration panel, accessible from the menu “Analyze > Survival > Kaplan-Meier…” (see Figure 3). Subsequent to the selection of the Kaplan-Meier function, the user has to set up all of the configuration parameters and, in particular, has to enable the display of the survival curve from the “Options” menu. To finish the setup, the user has to click on the “OK” button to start the overall survival curve computation related to the selected probe (i.e., in this example the probe is AM_13458) and will be conveyed to the user as depicted in Figure 4. It is worthy to note that users have to manually investigate each probe one by one to identify significant correlation between SNPs and overall survival. Conversely, using OSAnalyzer the same overall survival analysis requires less effort for the user. In fact, users have to load the whole OS-dataset and the tool conveys to the user the probe ranked accordingly to its statistical significance of the log-rank test. Finally, comparing the survival curves obtained by OSAnalyzer with the survival curves produced by SPSS can prove the reliability of OSAnalyzer. Both curves present the same trends, both have median values equal to 0, and censored data show the same distribution on each curve. Finally, both diagrams present a statistical significance equals to 0.087 meaning that the observed differences among the groups are due to the chance. Figure 4, obtained by using SPSS, is more complex to understand because it does not provide any information about which are the SNPs that belong to the probe AM_13458 due to the conversion step. In Figure 4 each curve is identified by means of the value “1.00, 3.00, 6.00”. Instead, in Figure 5, each curve is identified by the corresponding SNP, making it easier to identify which SNP is responsible for shorter survival, or toxicity, and so forth.