3. Results
This section describes the main features of the OSAnalyzer tool and presents an experimental case study of a genomics dataset annotated with clinical data by using OSAnalyzer.

3.1. OSAnalyzer
OSAnalyzer is a platform-independent application and it is entirely implemented using the Java 6 programming language, making it available for Windows, Linux, and MacOSX operating systems.
OSAnalyzer provides a simple and essential graphical user interface (GUI) allowing the users easy access to the tool`s functionalities. OSAnalyzer is very simple to use, the analysis of a complete OS-dataset requires just some clicks with the mouse, which are: (i) load the input OS-dataset by using the command “File” located in the menu bar as shown in Figure 6a; as result, OSAnalyzer shows to the user a file system navigation windows; (ii) browse the file system, in order to select the file to analyze (see Figure 6b); and (iii) wait as the OSAnalyzer finishes the computation and shows the results sorted in descending order according to the statistical significance of the log-rank test, and conveyed in two separate navigation panel results, one for OS, and one for PFS.
The use of two different navigation panel results makes it simple to locate the most relevant results for OS and PFS. Thus, users can easily find out the significance of a probe with respect to the OS and PFS (see Figure 6c). The main goal of the navigation panel results is to simplify the analysis of the results. In fact, each meaningful probe is related with three curves obtained by comparing the detected alleles in pairs, which present different values of significance. Thus, using the search function (locate in the top right corner of the window), it is possible to see the value of each curve (log-rank test) obtained comparing the three groups among them, Figure 6c. Selecting one probe per time and using the buttons “plot-os” or “plot-pfs” it is possible to see OS or PFS curves as depicted in Figure 6d. Moreover, at the bottom of the window, OSAnalyzer provides a quick summary of the most important measures for survival curves comparison such medians and the hazard ratio for each curve, as shown in see Figure 6d. The user can save the results and the curves that he/she considers relevant on file, by clicking with the right button of the mouse on the chart. Finally, OSAnalyzer can automatically annotate the relevant SNPs related to the overall survival by using annotation libraries provided by Affimetrix, or retrieving further information from dbSNP and PharmaGKB databases. OS-Analyzer is distributed under Creative Commons License, is freely downloadable for academic and not-for-profit institutions at: https://sites.google.com/site/overallsurvivalanalyzer/.
The automatic analysis of the whole microarray dataset avoids wasting time on the manual analysis of all probes in order to figure out which probes are relevant from an overall survival or PFS point of view. Users who are exploiting this feature of OSAnalyzer can automatically analyze a whole DMET microarray dataset in one go without further effort, as opposed to other available tools where the user is forced to manually organize the analysis of the whole dataset each time, increasing the possibility of introducing mistakes. This way, OSAnalyzer allows the users to focus only on the analysis of the results.
The capabilities made available by OSAnalyzer are: Loading and Analysis of OS-datasets: OSAnalyzer is currently able to parse information encoded in xlsx format (file format defined by Excel) and CSV (comma-separated value) data files, as well as tab-delimited files. This way users may also prepare their own dataset, e.g., merging together samples coming from different experimental batches;Overall survival significance: OSAnalyzer automatically computes and visualizes the overall survival significance related with all probes, showing to the users the probes ranked by p-value significance;Progression-Free Survival significance: OSAnalyzer automatically computes and visualizes the progression-free survival significance related with all probes, showing to the users the probes ranked by p-value significance;Overall and Progression-Free survival curves visualizer: it is possible to display the survival curve related with a selected probe. Furthermore, the current version of OSAnalyzer provides the users with additional information for the median related with each curve, the log-rank p-value and the hazard-ratio value;

3.2. Case Study
In this section, we will discuss K and M assessment in the context of overall survival before the event of interest, and how to read the results obtained from overall survival through a case study.
Generally, the purpose of the overall survival analysis is to employ the data available to provide assessment of the change of surviving to different times.
Clinical annotations are provided by the physicians and include overall survival, PFS, and response for each patient. These data are added to the DMET dataset (let see Table 1) to obtain the OS-dataset to correlate each SNP of the ADME genes to OS and PFS.
The OS-dataset (shown in Table 2) is achieved by adding clinical annotation (temporal clinical trends for each patient) that are: overall survival data (expressed in months) are collected from the starting point; for example, when the treatment starts or when the subject is enrolled into the study, to the end point that is, when the event of interest is reached i.e., dead. The occurrence of the event of interest is handled by using the Status-OS variable where 1 means the occurrence of the event of interest. Instead, 0 indicates censored data i.e., the subject is dropped out to the study for an unknown reason. PFS data (expressed in months) are collected for each subject, beginning when the subject starts the treatment and ending when the disease progresses or when the subject dies for any reason. PFS-Status variable takes a value of 1 that indicates the occurrence of the event of interest, whereas a value of 0 indicates a censored data. Finally, the response variable conveys the presence of metastasis when it assumes value equals to 1 and the absence of metastasis when it assumes value equals to 0.
In this way, OSAnalyzer can compute OS and PFS due to the presence/absence of SNP of the ADME genes for each probe, and, by using the log-rank test, it can compare and rank each SNP according to the p-value significance.
These results may help clinicians to understand if those SNPs may play a role in improving the response to cancer treatments and finally patients’ outcome.
A K and M estimator provides a graph of the survival function that summarizes the time-related information. To illustrate the OS analyses by using OSAnalyzer we generated a synthetic dataset (which is randomly generated) of 80 patients affected by an advanced cancer as the basis of an observational study of this disease.
Thus, survival analysis uses information from the whole follow-up period allowing us to illustrate the important point that comparative analysis between OS-curves depends upon the area under the K and M curves (AUC) and not only on differences based on single points, especially in real clinical studies.
The first step of the K and M analysis concerns with the data collection and arrangement. Data arrangement is necessary to make data in an appropriate format expected from the chosen analysis tool. There exist plenty of statistical analysis tools, available under GNU General Public License such as OSAnalyzer (https://sites.google.com/site/overallsurvivalanalyzer/), PSPP, or proprietary software, such as SPSS and MATLAB, each one with its requirements in terms of data arrangement. All software cited above require that data be arranged in a tabular form, containing, at least, the following information: (i) serial times; (ii) status at serial time (1 event of interest; 0 censored); and (iii) other kinds of clinical data, such as response rate, istotype, sex, etc., as nominal variables.
In any case, before beginning the analysis of the data, it is necessary to choose the analysis tool. The choice must be made according to the type of data to analyze. For instance, to analyze an SNP DMET dataset with SPSS, the user is required an extra effort to convert each single SNP “A/A, A/T, ...” in numerical values, given the impossibility of SPSS to analyze string values. Such conversion must be done manually by the user, increasing the probability of introducing errors due to the manual translation. It is worthy to note here that the translated file is necessary even for the PSPP software tool. To avoid this expensive step, OSAnalyzer can automatically analyze such SNP DMET datasets, and, most importantly, provide to the users all of the results ranked accordingly to the statistical significance of log-rank test.
To illustrate how this all works, we prepared a synthetic OS-dataset extended with temporal data related to the subjects in each of the three groups related to the 3 allele variants in each probe (total of 80 subjects). The event of interest is “death” represented by the symbol 1. To understand the K and M curves let us look to Figure 7.
The lengths of the horizontal lines along the X-axis represent the duration of the survival time. The slopes indicate the end of an interval, due to the occurrence of the event of interest. The vertical lines have an aesthetic function only because vertical lines make the curve more pleasing to observe. Although the primary function of vertical lines is aesthetic, the distance between horizontal lines is crucial because they convey the change in cumulative probability. The following is an example of how the points of a survival curve could be roughly interpreted. Let us start to analyze the cumulative probability of surviving a given time could be read on the Y-axis. For example, the probability of surviving 28 months of the patients in the group labeled “T/T” is 60%; conversely, the probability of surviving the same time for patients belonging to the groups “C/T” and “C/C” is slightly more than 90%. It is worthy to note that the steepness of the curve is due to the absence of the event of interest (that is the length of horizontal lines). The censored patients are another element that impacts the survival point. Censored patients are indeed represented as tick marks on the survival curves. Censored values impact the cumulative probability of the groups under investigation. In details, the fourth and the fourteenth censored patients (represented by the ticks on the curve) into the “C/T” and “T/T” groups respectively, contribute in reducing the survival probability to live at least 28 months. Whereas, the fifth censored patients into the “C/C” group did not change the survival probability to live 28 months. However, the censored values contained in the three groups impact on reducing the cumulative survival among the intervals. Hence, we must be careful in interpreting anything beyond this point because our temporal data does not allow to extrapolate any further hypothesis on survival. It is worthy to note that intervals (horizontal lines) in the K and M curve are constructed only for the events of interest and not for the censored patients. As stated, this is conveyed in Figure 7 by means of the corner joining horizontal with vertical segments. Thus, in group “C/T”, “C/C”, and “T/T”, there are four, three, and nine events (vertical connections between the end of one interval and beginning of the next) demarcating five, four, and ten intervals (horizontals), respectively. It is worthy to note that there are no vertical changes due to the censored patients. Moreover, Figure 7 highlights in a remarkable way the capability of the K and M method to deal with variable intervals.
The comparison of survival curves is the most important step in all medical oncology clinical trial studies. The shape of the curve is important to evaluate. Curves that have many small steps usually have a higher number of participating subjects, whereas curves with large steps usually have a limited number of subjects and are, thus, less accurate. Whereas it is simple to visualize the difference between two survival curves, the difference must be quantified to assess statistical significance. The log-rank test and hazard ratio are the most common methods used for comparing survival curves. In detail, the log-rank test suggests whether two curves are statistically different, whereas the hazard ratio shows the increased rate of having an event in one curve versus the other.