PMC:5197943 / 6617-24675 JSONTXT

Annnotations TAB JSON ListView MergeView

    2_test

    {"project":"2_test","denotations":[{"id":"27669316-15262801-69476992","span":{"begin":5568,"end":5570},"obj":"15262801"},{"id":"27669316-11309499-69476993","span":{"begin":5953,"end":5955},"obj":"11309499"},{"id":"27669316-16642009-69476994","span":{"begin":6665,"end":6667},"obj":"16642009"},{"id":"27669316-24367507-69476995","span":{"begin":8694,"end":8696},"obj":"24367507"},{"id":"27669316-23555212-69476996","span":{"begin":9091,"end":9093},"obj":"23555212"},{"id":"27669316-21903630-69476997","span":{"begin":9563,"end":9565},"obj":"21903630"},{"id":"27669316-23550210-69476998","span":{"begin":9783,"end":9785},"obj":"23550210"},{"id":"27669316-19393097-69476999","span":{"begin":10399,"end":10401},"obj":"19393097"},{"id":"27669316-24066126-69477000","span":{"begin":10845,"end":10847},"obj":"24066126"},{"id":"27669316-16118262-69477001","span":{"begin":15462,"end":15463},"obj":"16118262"},{"id":"27669316-20214778-69477002","span":{"begin":15464,"end":15466},"obj":"20214778"}],"text":"2. Materials and Methods\nOSAnalyzer is a software tool for the automatic association analysis among the genome variants of patients and their clinical condition, e.g., the different responses to drugs. OSAnalyzer allows for automatically computing the OS and PFS analysis of a whole DMET dataset previously annotated by adding the clinical data of patients.\nOSAnalyzer is based upon suitable data structures and algorithms able to optimize the computational effort necessary to compute the OS and PFS of a whole DMET dataset encompassing temporal events (such events are usually added to the DMET data only after the DMET analysis has been performed).\nFiles produced by using the DMET platform are in a binary microarray data format (e.g., a Affymetrix Probe Results File in Computing category (CEL) Affymetrix raw data file for each sample) to obtain a single table of alleles must be used by the DMET console tool to aggregate data coming from all samples of a dataset. Moreover, the DMET console can export an SNP table in tabular-delimited text file format (.txt), a standard format for text files, or exported as standard Excel files (.xls). In detail, the first row contains the identifiers of the samples while the first column contains the identifiers of the probes. DMET console-exported outcomes are arranged as a large n × m matrix of single nucleotide polymorphisms (SNPs), where n is the number of probes (n = 1936 for the current DMET chips) and m is the number of samples (patients), as depicted in Table 1.\nA generic element (i,j) in Table 1 contains the i-th identified SNP in the j-th sample, so it has the form X/Y, where X, Y ∈ {A,T,C,G,-}.\nIn order to make the automatic computation of OS and PFS possible for OSAnalyzer, it was necessary to extend the output produced by the DMET platform by adding clinical information after the header row. For each sample (patient), the supplementary information are, temporal information i.e., the period between the start and end of the clinical observation along with a status, indicating whether or not each patient had a clinical event of interest, e.g., death or metastasis. The extended dataset, hereinafter called OS-dataset, is arranged as conveyed in Table 2.\nAfter annotating a DMET dataset, all the information necessary to compute the OS and PFS curves are available. The curves are obtained by implementing the well-known methodology defined by Kaplan and Meier [13]. The K and M estimator is a non-parametric statistic used to estimate the survival function S(t), which is defined as the probability that an individual survives more than time t.\nThe survival function S(t) is computed for each time interval, in particular, survival probability is calculated as the ratio between the number of surviving patients and the number of at-risk patients. Patients who have died, dropped out, or that did not reach the time yet, are not counted as at-risk. Patients who are lost for some reason are considered censored and are not included in the denominator. The probability of surviving to any point is estimated from the cumulative probability of surviving each of the preceding time intervals (calculated as the product of preceding probabilities). The probability of surviving from 0 to t1 is then estimated by Equation (1): (1) S(t1)=1 − d1r1 where the ratio d1r1 is the estimated proportion of dying in that interval. The estimated survivor function for any arbitrary time t is given by Equation (2): (2) S(t)=∏1N1−diri\nIt is worth noting that we are considering the product of all i values for which t’ is less than the time t (which is the number of patient at risk at time t’).\n\n2.1. OSAnalyzer and SNPs Handling for Computing Kaplan-Meier\nThe identifiers of SNPs (i.e., A/G, G/A, and so on) typical for the OS-dataset represented in Table 2 are used to divide a generic probe into groups of samples that have the same SNP, but are not useful to estimate K and M. At most three groups of samples can be presented in each probe, given that the possible combinations of SNPs for each probe are equal to three (i.e., if the probe contains the alleles A and G then the possible combinations will be A/A, G/A, and G/G). In this way, for each probe it will be possible to group together the samples with the same polymorphism and evaluate the survival of the samples belonging to the identified groups. Before creating groups of samples identified by the same SNP, it is essential to link together all the SNPs and survival data for the same sample. This connection was realized by defining a virtual projection function (VPF) able to link together the survival data of each subject with their SNPs. The VPF virtually links together the survival data with all the SNPs belonging to the same sample without replicating the overall survival data for all the 1936 SNP for each subject, as conveyed in Figure 1.\nAt the same time this tight coupling between SNPs and the related survival data make it possible to split each probe into groups and accurately compute the K and M estimator for each identified group. By means of the VPF, OSAnalyzer is able to compute K and M by using only temporal data without needing to convert the literal SNPs to numerical values.\n\n2.2. Related Works\nIn this section we highlight the main features and capabilities of several available software tools to analyze and validate survival data.\nBelow, the main features and capabilities of some well-known stand-alone and web-based tools used to analyze survival data are summarized.\nPartial Cox regression analysis (PCR) [15] is a collection of command line R scripts written using the R code language. Partial Cox regression analysis is based on a partial Cox regression method for constructing mutually uncorrelated components based on microarray gene expression data for predicting the survival of future patients. The R code of PCR is available upon request.\nSignificance analysis of microarrays (SAM) [16] is a statistical software for finding significant genes in a set of microarray experiments. The data should be put in an Excel spreadsheet and arranged as follows: the first row of the spreadsheet has information about the response variable, whereas all remaining rows contain gene expression data, one row per gene. The response variable may be a grouping like (untreated, treated), a quantitative variable (like blood pressure), or a possibly censored survival time. SAM is used to identify genomic features correlated with biological and/or clinical phenotypes of interest, including time-to-event clinical outcome. SAM is freely available after registration as an R package or Excel add-in.\nGenePattern [17] is a software based on a simple graphical user interface (GUI) used to provide access to a broad number of computational methods represented to the user as graphical modules, used to analyze genomic data. To analyze a particular dataset, users have to manually define a customized pipeline using the available modules. Finally, among the other statistical capabilities provided by GenePattern, it is possible to create and visualize survival curves based on censored data arranged in a “cls” file. GenePattern is freely available, but before the first run the GenePattern registration page will appear in your browser (modules run only on the GenePattern server).\nPSPP (https://www.gnu.org/software/pspp/) is a free software application (free alternative to SPSS) for statistical analysis of sampled data. It comes with a graphical user interface making it easy to use the available scientific capabilities. Furthermore, for the advancer user, it is possible to use PSPP by the command line interface to obtain better performance.\nR (https://www.r-project.org) is an open source software environment for statistical and mathematical computing and graphics. Moreover, R can be considered as a different implementation of the S language, making it possible to write code that runs under R.\nSPSS (http://www-01.ibm.com/software/analytics/spss/) is a software package developed to help users in statistical data analysis process. In addition to statistical analysis SPSS provide to the users data mining, text analytics, and data collection capabilities.\nMATLAB (www.mathworks.com/products/matlab/) (matrix laboratory) is a numerical computing environment. MATLAB allows a variety of statistical tool to manipulate data and graphical techniques to produce quality plots. On the other hand, MATLAB introduces its proprietary programming language with which it is possible to use all the functionalities available to MATLAB, including the implementation of algorithms, creation of user interfaces, and so on.\nKaplan-Meier Plotter [18] is an integrated database (containing breast, ovarian, and lung cancer information) and an online tool capable of uni/multivariate analysis for in silico validation of new biomarker candidates in non-small cell lung cancer. Univariate and multivariate Cox regression analysis, Kaplan-Meier survival plots with hazard ratio and log-rank p-values are calculated and plotted by using R.\nNet-Cox [19] is a network-based Cox regression model used for a large-scale survival analysis across multiple ovarian cancer datasets. Datasets have to be arranged as a tabular delimited text file in order to be compatible with Net-Cox. In particular, compatible datasets are obtained by combining cancer information and patient clinical information stored in separate files into a unique file arranged as textual matrix. Net-Cox is freely available as a MATLAB plug-in.\nSurvcomp [20] is an R-based Bioconductor freely-available package for survival risk model comparison. The survcomp package provides functions to assess and compare the performance of risk prediction (survival) models.\ncBioPortal [21] is a web-based resource for cancer genomics providing visualization, analysis, and downloading of large-scale cancer genomics datasets. All functionalities in cBioPortal visualization, querying, and analysis in cBioPortal are easy to use by means of a graphical user interface, including the analysis of survival data analysis based on Kaplan-Meier, and log-rank test to compare multiple survival curves. cBioPortal is also available as CGDS-R/MATLAB packages. Providing a basic set of functions for querying the cancer genomic data server (CGDS) via the R/MATLAB platform for statistical computing.\nPrognoScan [10] is a freely-available web-database usable through a simple graphical user interface. PrognoScan makes it easy for users to search the relation between gene expression and patient prognosis, such as overall survival (OS) and disease-free survival (DFS) across a large collection of publicly available cancer microarray datasets. PrognoScan includes the Kaplan-Meier estimator and the plotter function to plot Kaplan-Meier curves.\nSurvExpress [22] is a comprehensive gene expression database and web-based tool providing survival analysis and risk assessment in cancer datasets using a biomarker gene list as: Kaplan-Meier plots, the log-rank test of differences between groups, and the hazard ratio estimate.\nAll of the tools listed above have been developed for the analysis and visualization of gene expression data, with the exception of R, MATLAB, SPSS, and PSPP, which can be classified as general numerical computing environments.\nEach tool used for analyzing input datasets requires that datasets meet specific criteria; otherwise the tools cannot analyze the data. For example, to analyze gene expression data with SAM data must be put into an Excel spreadsheet, where the first row contains the response measurements, one per column, starting from the third column. The remaining rows contain gene expression measurements, one row for every gene. “Column1” contains the gene name, “Column2” contains the Gene ID, whereas the remaining columns contain the expression measurements as numbers. Missing expression measurements should be reported as either blank or non-numeric values. It is worthy to note that the file format defined by the SAM tool is not compatible with the file formats defined by other software instruments. The tools developed to analyze gene expression data cannot deal with non-numerical value, limitation that highlights the impossibility for SAM and the other gene expression tools to analyze the DMET dataset since the DMET dataset contains only non-numerical values.\nDifferent assessments must be made for general numerical analysis environments when they are used to analyze DMET datasets. Although they are numerical analysis tools they are not intended to work directly with SNP datasets, the reason for which SPSS, R, and so forth can analyze OS-datasets only after the dataset has been manually converted in a format compatible with the tool used for the analysis. To the best of our knowledge, there are currently no tools for import/export of DMET/OS dataset in R, SPSS, MATLAB, and so on. For example, analyzing an OS-dataset with SPSS requires a significant effort for the users because he/she has to manually convert each of 1936 probes (rows) times the number of subjects (columns). Whereby, an OS-dataset with 100 subjects requires 193,600 conversions to convert each SNP into a numerical value (i.e., assigning the value 1 to A/A, 2 to A/C and so on). Moreover, in addition to translation of SNP, users must take into account the removal of special symbols such as “NoCall”, “ZeroCopyNumber”, “RareAllele”, and so on, because they are not useful for the data analysis. After this conversion step the user can upload the file in SPSS and must plot each probe individually, one after the other, in order to identify significant correlations between SNPs and overall survival. Another limitation of converting SNPs into numbers is that when users plot the overall survival curves, users lose the information about the SNP and the related curve. This correlation has to be remembered by the user that has to manually associate the numerical value to the plotted SNPs. Another way to analyze DMET data with general numerical analysis environments is to write a custom code able to import and convert the DMET dataset in a compatible format with the tool. This requires advanced programming skills and,it is time consuming and error prone. However this is a task not easy to pursue by life scientists, who only need to analyze data and to obtain clues about correlations among overall survival and SNP.\nConversely, OSAnalyzer is different from the tools listed above due to the fact that to the best of our knowledge, it is the unique that come with the capability to automatically analyze a whole DMET dataset annotated with clinical data. Automatic data analysis makes it possible to highlight which genomic features are useful for the association with clinical outcomes, including, for example, the response to certain treatments and prognosis of the patients under specific clinical scenarios. Thus, data analysis become straightforward, without users having to worry about how data should be arranged, which settings use or, even worse, manually investigate all the dataset to detect significant clues. Manual analysis is time-consuming and may increase the probability to introduce mistakes reducing the accuracy of the results. Instead, the use of OSAnalyzer avoids wasting time on the manual analysis of all probes in order to figure out which probes are relevant from an overall survival or PFS point of view [9,23]. On the other hand, the current version of OSAnalyzer cannot analyze gene expression data in order to plot overall survival curves and it presents limited data analysis functionalities.\nDue to space limitation, we present only the comparison between OSAnalyzer and SPSS to prove the reliability of our tool. To be able to analyze OS-datasets with SPSS the first operation is to convert the literal SNP symbols contained into the OS-dataset into numerical values. To speed up the translation process, we used regular expressions to convert each SNP in the dataset into a unique numerical identifier. The translated file is depicted in Figure 2.\nAfter loading the converted OS-dataset, the survival analysis in SPSS can be done by using the configuration panel, accessible from the menu “Analyze \u003e Survival \u003e Kaplan-Meier…” (see Figure 3).\nSubsequent to the selection of the Kaplan-Meier function, the user has to set up all of the configuration parameters and, in particular, has to enable the display of the survival curve from the “Options” menu. To finish the setup, the user has to click on the “OK” button to start the overall survival curve computation related to the selected probe (i.e., in this example the probe is AM_13458) and will be conveyed to the user as depicted in Figure 4. It is worthy to note that users have to manually investigate each probe one by one to identify significant correlation between SNPs and overall survival.\nConversely, using OSAnalyzer the same overall survival analysis requires less effort for the user. In fact, users have to load the whole OS-dataset and the tool conveys to the user the probe ranked accordingly to its statistical significance of the log-rank test. Finally, comparing the survival curves obtained by OSAnalyzer with the survival curves produced by SPSS can prove the reliability of OSAnalyzer.\nBoth curves present the same trends, both have median values equal to 0, and censored data show the same distribution on each curve. Finally, both diagrams present a statistical significance equals to 0.087 meaning that the observed differences among the groups are due to the chance. Figure 4, obtained by using SPSS, is more complex to understand because it does not provide any information about which are the SNPs that belong to the probe AM_13458 due to the conversion step. In Figure 4 each curve is identified by means of the value “1.00, 3.00, 6.00”. Instead, in Figure 5, each curve is identified by the corresponding SNP, making it easier to identify which SNP is responsible for shorter survival, or toxicity, and so forth.\n"}