PMC:4996407 / 17441-17649 JSON TXT

Mining the Dynamic Genome: A Method for Identifying Multiple Disease Signatures Using Quantitative RNA Expression Analysis of a Single Blood Sample Abstract Background: Blood has advantages over tissue samples as a diagnostic tool, and blood mRNA transcriptomics is an exciting research field. To realize the full potential of blood transcriptomic investigations requires improved methods for gene expression measurement and data interpretation able to detect biological signatures within the “noisy” variability of whole blood. Methods: We demonstrate collection tube bias compensation during the process of identifying a liver cancer-specific gene signature. The candidate probe set list of liver cancer was filtered, based on previous repeatability performance obtained from technical replicates. We built a prediction model using differential pairs to reduce the impact of confounding factors. We compared prediction performance on an independent test set against prediction on an alternative model derived by Weka. The method was applied to an independent set of 157 blood samples collected in PAXgene tubes. Results: The model discriminated liver cancer equally well in both EDTA and PAXgene collected samples, whereas the Weka-derived model (using default settings) was not able to compensate for collection tube bias. Cross-validation results show our procedure predicted membership of each sample within the disease groups and healthy controls. Conclusion: Our versatile method for blood transcriptomic investigation overcomes several limitations hampering research in blood-based gene tests. 1. Introduction Somatic Versus Dynamic Genome The human genome can be explored in two different dimensions: the somatic genome and the dynamic genome. The somatic genome is the heritable DNA structure of an organism, with mutational heterogeneity that can be either the cause or effect of disease. To investigate hereditable factors and somatic mutations in disease, researchers have explored the somatic genome, using such methods as DNA sequencing technologies, single nucleotide polymorphism arrays, and genome-wide association studies. The results have been biologically informative and have produced a few clear medical and clinical successes, such as BRCA and HER2 testing in breast cancer. However, in general, results have been disappointing [1]. A fundamental problem with cancer DNA genome studies is that the genetic mutations for any given cancer type discovered by modern sensitive analytical methods number in the hundreds, and the determination of the true somatic mutation(s) driving cancer progression is difficult in the clinical setting for any individual patient [2]. Furthermore, cancer therapies targeted at a single somatic mutation have proven to have limited effect over time, because of resistance caused by cellular somatic evolution [3]. As human cancer is usually polypoid, containing subpopulations of multiple aneuploid cancer tumor cells, aneuploidy is also the hallmark of cancer cells in general [4]. Ploidy sequencing has been proved to be beneficial in revealing the somatic evolution of cancer tumor cells [5]. However, the clinic application of DNA ploidy assessment is very limited for cancer detection because it is still controversial that aneuploidy is the cause of cancer [6]. The other approach—exploring the dynamic genome—is one that we believe to be more powerful for clinical applications. This approach investigates not the DNA of an organism, but the transcriptional activity of an organism’s genes. The result of this activity is the transcriptome: the complete set of RNA transcripts present in a cell or tissue at any one time. Although the DNA of a particular cell or tissue, the genome, is uniform throughout the organism and except for infrequent random mutations, essentially unchanging, its transcriptome may vary according to the current physiological status of the cell, tissue or organism. Since mRNA profiles will alter in response to the cellular environment, the transcriptome will always be changing in response to immune factors, drugs, disease onset and progression, and healing [7]. To date, the dynamic genome has best been interrogated using microarray studies. Microarray chips can provide a snapshot of an organism’s gene expression activity at a given time. Compared to RNA sequencing, microarray is more established and cost-effective in analyzing the expression of defined genes by high throughput methods. Furthermore, microarray data is not as complex as that of RNA sequencing, which make it easier to analyze and apply widely to various fields. However, traditional, tissue-based microarray studies have a number of disadvantages. Invasive biopsies can be obtained only in very late stage disease, at transplant or after death in the case of difficult-to-access organs such as lung, breast, prostate, heart, and brain. For these reasons, tissue-based microarray is less useful for research in early-stage disease, or in the clinic. One additional limiting factor in any tissue-based technology is the problem of heterogeneity. Diseased cells are not necessarily homogeneously distributed throughout tissue, and in cancer, malignant cells can differ from each other in their mutations [8]. Thus, analysis based on solid biopsy needs to take these factors into account by taking multiple invasive samples. However, this requirement increases the cost of the test and the test’s inconvenience to patients. To avoid this problem, so called “liquid biopsies” attempt to detect circulating tumor cells from a blood sample. While the presence of circulating tumor cells is a strong prognostic factor for overall survival in certain cancer patients, the clinical significance of circulating tumor cells in most patients is still unknown [9]. Furthermore since circulating tumor cells are very few in number in early stage cancer, analysis requires extreme analytical sensitivity to detect what very few cells are present and at a cost of many false positives, making testing unreliable. By contrast, white blood cells in peripheral blood provide a near ideal diagnostic sample. Blood sampling is a long-established and well-accepted procedure for disease diagnosis and monitoring. Whole blood is easy to access, and patients and physicians are accustomed to blood sampling. White blood cells are much more abundant than circulating tumor cells, which eliminates the challenge of analytical sensitivity. Furthermore, because the sample is liquid, the distribution of cells is homogeneous. This reduces or eliminates the need for multiple samples. Another advantage of using whole blood is that the immune cells in blood are biologically affected by disease located elsewhere in the body regardless of the tissue affected [10,11]. Thus the requirement for direct biopsy is reduced and even potentially eliminated. For these reasons blood-based transcriptomics has significant advantages over tissue-based biopsy technology. We propose this concept as the Sentinel Principle®, which employs the circulating blood cells as “sentinels” for detecting and responding to micro-environmental changes in the body. Blood-based transcriptomics may also come to play an important role in detecting disease at an early stage in which clinically apparent pathological variants have not yet emerged. Blood cells act as transporter cells and as mediators of the immune response and are involved in the pathogenesis of many diseases [12]. Thus when physiological or pathological insults occur anywhere in the organism, the gene expression profile of the peripheral blood cells will change in order to carry and to transfer information to engage the immune system and maintain physiological homeostasis [13]. Our previous research has shown that peripheral blood cells respond differently to various pathological changes and thus analysis of differential gene expression can distinguish between and among diseases [14]. Furthermore, since the interaction between the immune system and disease usually precedes the occurrence of clinically pathological variation, the study of blood cell profiles might make it possible to detect diseases at an early stage. For instance, according to the cancer immunoediting theory, cancer has a long equilibrium phase in which tumor cells survive immune elimination and maintain a state of functional tumor dormancy [15]. During the equilibrium phase, there is no clinically detectable pathological variation, but the interaction of the immune system against cancer cells occurs covertly in the body. Our hypothesis is that early stage cancer and other diseases can be detected by analysis of variation in the gene expression profiles of peripheral blood. Although blood-based transcriptomics research has great potential in clinical application, it has its own set of limitations. In order optimally to measure the expression levels of messenger RNA in whole blood, several challenges need to be overcome. First, whole blood introduces into sampling the factor of white blood cell population distribution. Second, the dynamic composition of blood cells in response to a constantly changing environment means that even copy numbers normalized to total cell count and sample volume may exhibit variability. Third, sampling technology can introduce additional artefacts, such as differences between EDTA and PAXgene™ (PreAnalytiX) collection tubes and protocols. Moreover, microarrays from different manufacturers each have their own peculiarities and need corrective measures tailored for each. These factors make the analysis of the data more complex and time consuming. A practical solution is required that can identify well-performing gene panels in the face of these challenges. The most common of these limitations, the interference derived from different blood sampling technologies, cannot be ignored in blood-based transcriptomics research. The conventional method for drawing blood uses EDTA collection tubes, which inhibit clotting but do not stabilise intracellular RNA. When EDTA blood collection tubes are used, intracellular RNA needs to be isolated within four hours of collection, as RNA degrades rapidly. Therefore, EDTA collection tubes are not practical for clinical applications when tests involve RNA and the collection sites are far from the laboratory. To overcome this problem, PAXgene tubes contain reagents that stabilise RNA, allowing easy blood collection, storage and transport of blood samples [16]. However, excessive globin mRNA levels interfere with transcript measurement and increase variability. This difficulty is addressed by the use of specifically designed reagents with different degree of globin signal suppression. Thus, gene expression profiles derived from EDTA and PAXgene blood collection tubes are not completely consistent between samples drawn from the same patient and processed under different protocols, an inconsistency that may lead to confusing or contradictory results. 2. Experimental Section In an earlier study we published the predictive performance of gene panels identified by the method developed by our group using a data set consisting of 631 blood samples collected in EDTA tubes and discriminating for three diseases with an area under the receiving operator characteristic curve (AUROC) ranging from 89% to 93% [17]. We later expanded the set to include 17 diseases represented by more than 1700 samples, and we were able to obtain similar prediction performance. However, because our samples were collected using EDTA tubes, the results would not be useful for many clinical applications, as discussed above. We have therefore transitioned to PAXgene tube collection and have started to rebuild our disease panels. This offered us an opportunity to evaluate the performance of our method against an established statistical package, Weka [18], for a study in which samples were collected initially using EDTA tubes and then PAXgene tubes. In this two-part study, we first present details of our method and demonstrate its ability to overcome a known and documented bias between samples collected in EDTA and PAXgene tubes. The PAXgene samples were processed using Nugen reagent kit (NuGen, San Carlos, CA, USA). By extension, this method should be able to overcome other biases, which may not be known in advance. Then, we present the results of this method applied to a new cohort of samples representing ten cancers and diseases collected in PAXgene tubes and processed using the current 3′ IVT PLUS reagent kit (Affymetrix, Santa Clara, CA, USA). We switched from the Nugen kit because the new reagent kit has better repeatability performance in our laboratories. 2.1. Demonstration of EDTA and PAXgene Collection Tube Bias Suppression For the initial part of the study, we used blood samples that we collected for a liver cancer (hepatocellular cancer) study. The samples comprised blood taken from hepatocellular cancer (HCC) patients, chronic Hepatitis B (HpB) patients and healthy controls, all recruited in Malaysia under approved protocols [19]. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki. Our training set consisted of HCC patients: 26 HCC collected in PAXgene tubes and 20 HCC collected in EDTA tubes. We also had 28 blood samples taken from patients with chronic hepatitis B (HpB) infection collected in PAXgene tubes, 28 confirmed HCC-negative (control) samples also in PAXgene tubes and seven control samples collected in EDTA tubes. In addition, 830 samples from other studies (“other”) were collected in EDTA tubes. These “other” samples were assumed to be negative for HCC because it is a low-prevalence disease. The test set consisted of independent samples: 25 HCC collected in PAXgene tubes, 15 HCC collected in EDTA tubes, 27 HpB collected in PAXgene tubes, 27 controls collected in PAXgene tubes, 7 controls collected in EDTA tubes and 860 “other” samples collected in EDTA tubes (Table 1). microarrays-04-00671-t001_Table 1 Table 1 Breakdown of samples for collection tube bias demonstration. 2.2. New Samples and Cross-Validation After we demonstrated that this method works well in separating the three groups of liver cancer study samples (HCC, HpB, control), we proceeded to apply the same method to a new set of 157 samples representing ten diseases and healthy controls collected and processed in Penang and Shanghai (Table 2). microarrays-04-00671-t002_Table 2 Table 2 New samples for cross-validation. 2.3. Methods 2.3.1. Blood Collection, RNA Isolation and RNA Quality Control Peripheral whole blood was collected from patients in EDTA Vacutainer tubes (Becton Dickinson, Franklin Lakes, NJ, USA) and PAXgene tubes (PreAnalytix, Hombrechtikon, Switzerland). Whole blood RNA was isolated as described previously [20]. Isolated RNA was checked by using 2100 Bioanalyzer RNA 6000 Nano Chips (Agilent Technologies, Santa Clara, CA, USA). Samples were excluded from microarray analysis that did not meet the following quality criteria: RIN ≥ 7.0; 28S:18S rRNA ≥ 1.0. RNA quantity was determined by absorbance at 260 nm in a DU800 Spectrophotometer (Beckman Coulter, Brea, CA, USA) or a NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Scientific, Wilmington, DE, USA). 2.3.2. Microarray Hybridization and Probe Set Quality We used the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray (Affymetrix) and the FDA-cleared, CE-IVD marked Affymetrix Gene Profiling Array cGMP U133 P2 microarray (Affymetrix) for this study. Expression data was extracted by using the MAS5 analysis method, as each microarray needs to be evaluated independently. This technique is directed towards allowing single sample predictions, which are more practical in a clinical setting. We followed the MAQC list for the Affymetrix GenChip Human Genome U133 Plus 2.0 microarray and ignored probe sets that are not included in the list of MAQC for Affymetrix microarrays [21]. We then conducted an experiment to identify probe sets which exhibit stable results across 7 different chip lots using 4 replicates each, for a total of 28 hybridizations on a pooled mRNA sample extracted from whole blood (Table 3). Probe sets were ranked according to observed variability across the 28 hybridizations. microarrays-04-00671-t003_Table 3 Table 3 Four replicates across seven microarray lots. These findings were confirmed by two individual non-pooled samples, each run twice (Table 4). For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. microarrays-04-00671-t004_Table 4 Table 4 Non-pooled samples replicates. For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. Probe sets with expression levels lower than 100 were classified as too “noisy” based on the data from these repeated hybridizations, whereas those with expression levels greater than 10,000 were classified as “saturated” and unreliable for detecting change in expression (Figure 1). The probe sets must also belong on the validated list published by the MAQC study as well as verified to be repeatable on our own EDTA and PAXgene replicate experiments. Finally, any outliers are also excluded because of their uncharacteristic expression value. These steps are summarized in Table 5. Figure 1 Technical Replicate Hybridization. (a,b) Correlation between two replicates hybridizations of samples N23C and N82B; (c) distribution of replicate ratios for probe sets within the “filtered” list. microarrays-04-00671-t005_Table 5 Table 5 Probe Set Filtering. 2.3.3. Pairs of Genes Pairs of genes are the minimum unit for analysis when using self-normalization to suppress confounding factors, such as the diurnal cycle [22]. From the stable probe sets identified in the previous steps, pairs of genes were evaluated as ratios. A pair with an AUROC value of 0.7 or higher was classified as “significant”, and the pair was set aside as a candidate biomarker pair for further combination analysis. This AUROC value was selected from empirical experience as a good balance between potentially excluding valid probe sets and accepting those that would eventually prove to be unreliable in further evaluations. Pairs using the sum of the gene expression were also set aside as candidates if the pair had complementary noise which was reduced by the sum, as evidenced by a significant increase in AUROC above that of the individual genes. Additionally, if a pair had little correlation with the disease under study (AUROC~0.5), but showed good correlation with the significant pairs, then the pair was also set aside as a potential “suppressor” pair [23]. Finally, combinations of candidate biomarker and suppressor pairs were evaluated by AUROC and a short list was selected for validation on a test set or by multiple iterations of n-fold cross-validation. The concept of using pairs of genes is similar to the practice of using differential signals in electrocardiogram (EKG) and electroencephalogram (EEG) measurements (Figure 2). The desired signal is obscured by the electrical noise from both the external environment and spurious muscle contractions in the body of the patient. However, by selecting appropriate reference points to obtain a “suppressor” signal, it is possible to optimally recover the desired signal. Figure 2 EKG pair is “suppressor” used with “raw” EEG pair to obtain a clean EEG signal with reduced EKG artefacts. Specifically selected raw signals which seem to be useless noise, are combined to suppress masking noise, revealing the underlying useful information that was always present. 2.3.4. One-Against-All (Orthogonality) One of the difficulties of using whole blood is that its transcriptome contains information reflecting not only many diseases but also all kinds of other confounding factors. However, this breadth of information is also the key to the solution of the problem. Since so many factors are mirrored in the blood transcriptome, it should be possible to find a signature for each, and reduce the problem to a set of “independent” or orthogonal equations for which the solution becomes nearly trivial. This solution is based on our hypothesis that, whereas many genes are affected by more than one disease or condition, there may exist combinations of genes that are affected only by a single condition. By setting up an analysis to look for only those combinations of genes that respond to a single condition in the explicit presence of confounding factors such as other diseases, we will be able to identify those genes that match the independent equation case. One beneficial side effect of this solution is that sample acquisition becomes much simpler; it is not necessary to find samples from patients with two or more diseases or conditions of interest. The practical consequence is not trivial: for diseases with very low prevalence rates, patients with multiple disease combinations would be vanishingly rare and impossible to acquire. We assign the “other” samples to the “not-this-disease” group and take advantage of the relatively larger numbers to attenuate any “out of the ordinary” characteristics of an individual sample. As the number increases, the potential skew from any one individual is diluted. Additionally, the gene panel is trained to reject the signature of all other diseases included in the “other” samples. This is the “one-against-all” approach for analysing gene expression profiles that makes each prediction panel more specific to the target disease condition. 2.3.5. Group Balance A side effect of employing the one-against-all approach is that clinical information about a patient in a research study is usually limited to the condition being studied. For instance, in a colorectal cancer study, all the patients under study will be endoscopically examined and determined to either have colorectal cancer (case) or to be free from colorectal cancer (control). However, it is not usually possible to know for certain whether patients from other studies are truly free from colorectal cancer. Researchers can only assume that it would be unlikely for these patients to have colorectal cancer, which is a disease with a low prevalence (<1%). The problem is that there are many more of these patients with unconfirmed diagnoses as compared with the colonoscopy-verified cancer-free patients. It might be possible to incorrectly predict all confirmed colorectal cancer negative patients as false positive and still achieve a very high specificity by correctly predicting the samples from other studies as colorectal cancer-free. To account for such bias, statisticians weigh more heavily the relative contribution of high-quality data (verified pathology) relative to the larger amount of low-quality data (assumed colorectal cancer-negative pathology). In our approach, this accounting for bias is achieved by replicating the samples that need to be weighted more heavily. We chose replication because it has the added benefit that we can introduce a controlled amount of random Gaussian noise to simulate the effect of measurement uncertainty and reduce the impact of any single data point that might skew the results. To increase the contribution of the confirmed non-cancer cases, we replicated each PAXgene HCC sample 15 times and each EDTA HCC sample 20 times, to balance them with each other and with the samples from other studies. (Table 6). microarrays-04-00671-t006_Table 6 Table 6 Training Set Group Balance. The two horizontal red arrows indicate the balance between PAXgene and EDTA collected data, the in-between vertical green arrow indicates the balance between HCC and Control subgroups. The two vertical green arrows indicate the balance between Cancer and Control/Other subgroups 2.3.6. Search Speed Optimization However, the solution proposed above has the consequence of increasing both the size of the data set and the number of potential combinations to be evaluated to the point that it becomes too time-consuming to conduct a systematic search. Since the goal is only to find some combinations of genes that predict disease well enough to be useful, we used a Monte-Carlo approach to accelerate the search. The efficiency of a random Monte-Carlo evaluation can be illustrated by comparing the numerical estimation of the value of π with and without Monte-Carlo acceleration. The mathematical constant, π, represents the ratio of the area within a circle to the square of its radius. For a circle of unit radius with an enclosed area of π units squared, the enclosing square has sides of 2 units with an area of 4 units squared. A systematic evaluation would divide the enclosing square into a grid with regularly spaced points and determine whether each of these points is within the circle or outside it. The ratio of the number of points within the circle to the number of points inside the square is an approximation of π/4. A random search would select points at random rather than systematically march along the grid. As illustrated in Figure 3, the Monte-Carlo estimate comes within 5% of the correct value much more rapidly than the systematic evaluation. 2.3.7. New Samples and Cross-Validation The method was applied to a group of blood samples from ten different diseases and controls. We used the entire data set of ten diseases and controls to identify gene expression biomarkers associated with each of the ten diseases using the statistical method described above (one-against-all). This is the first data set collected entirely in PAXgene tubes. Because the sample numbers are still small, it is not practical to evaluate prediction performance by partition into a traditional training/test sets. Instead, we performed a 2-fold cross-validation iterated 1000 times. Figure 3 Estimating the value of π numerically. (a) Convergence speed using a systematic search (b) and a Monte-Carlo random search (c). 3. Results and Discussion We applied these procedures to a set of samples that had the complication that the samples had been collected using two different technologies: EDTA and PAXgene tubes. Although the raw expression levels exhibited differences between these two sets, using our strategy it was possible to find combinations of pairs that exhibited improved stability and were predictive of the underlying disease. The search identified a combination of 12 probe sets in six pairs from the filtered list of 586 candidates. This combination scored the samples with fairly good overlap between the PAXgene and EDTA samples in each disease category in both the training and independent test sets (Figure 4). For comparison we also plotted the predictions using a standard open-source statistical package, Weka. This analysis was conducted using the SimpleLogistic classifier function with default parameters employing all available 9163 probe set data without any filtering or weighting. The SimpleLogistic classifier has a built-in feature selection capability that selected the final set of 17 probe sets from the entire data (WekaRaw17Gene). This comparison using an unmodified version of Weka under default settings is not intended to critique Weka, but rather to highlight the effect of the additional steps described in this manuscript. Figure 4 Prediction scores using the method described in this paper (LogReg_6Pairs) Weka prediction using all data without any preprocessing (Weka Raw17Gene). The test set results from the WekaRaw17Gene panel trained without sample weighting show that the liver cancer samples collected in EDTA tubes (median = −0.50) have dropped in prediction scores relative to the PAXgene cancer samples (median = +0.28) and are now in the same range as chronic hepatitis B (median = −0.74) and control patients (median = −0.94). By contrast, the predictions using our method aligns both liver cancer groups (EDTA median = +0.80, PAXgene median = +0.77) which are well separated from the HpB samples (median = −0.38) and control samples (EDTA median = −1.34, PAXgene median = −0.57). These results illustrate that the novel statistical method described above worked well in separating the three groups of liver cancer study samples (HCC, HpB, control), even under different conditions, such as chip lot or blood collection tubes. Additionally, the loss of test set prediction accuracy resulting from the absence of the PAXgene HpB and other subgroups of samples can be overcome by using the information from other available PAXgene subgroups, as shown in the results presented in Figure 4. We then applied the same procedures to a group of blood samples from ten different diseases and controls collected in PAXgene tubes (Table 2). The results of 1000 iterations of 2-fold cross-validation for each disease’s gene pair panel are summarised in Table 7. When this method was used, all ten disease gene pair panels showed consistently high prediction performance: high sensitivity (mean 89%), high specificity for the healthy controls (mean 98%), high specificity for the other nine diseases (93%), and high AUROC (mean 96%). The prediction results for each individual subject for the risk of colorectal cancer are charted in Figure 5, which shows good discrimination between colorectal cancer and controls and the other nine diseases. All but one colorectal cancer subject achieved a score above the threshold value of 0 while only four of all the other subjects returned a false positive prediction. The other gene panels achieved similar results. Figure 6 is an example of the prediction of the risk of ten different diseases for a single liver cancer patient using the gene pair panels obtained using the method described in this paper. This liver cancer patient can be seen to be at high risk for liver cancer and at no higher than population average risk for the other nine diseases. microarrays-04-00671-t007_Table 7 Table 7 Performance of gene pair panels of each disease using our statistical method (1000 iterations of 2-fold cross validation). Figure 5 Prediction of risk for colorectal cancer for individual subjects using the colorectal cancer gene pair panel identified by the method described in this paper. Figure 6 Prediction of the risk of 10 different diseases for an individual liver cancer patient, using the gene pair panels obtained using the method described in this paper. This patient was known to have liver cancer and had no indication of any of the other diseases being evaluated. 4. Conclusions We have presented a procedure that identifies a set of probe sets which demonstrate reliable expression levels for target genes. Using these, we evaluated ratio pairs to achieve self-normalization. By combining discriminative pairs and suppressor pairs, we found useful panels of gene pairs that are able to predict disease even across varying conditions such as chip lot or sample collection tube differences. For comparison, we also processed the data with a widely-used machine learning package, Weka, using the SimpleLogistic model with automatic feature selection. Weka achieved the best overall discrimination with a panel of 17 probe sets, but was unable to suppress the bias introduced by the use of two different collection tubes. That is, whereas our method managed to align the liver cancer samples so that the majority (~75%) of both EDTA and PAXgene samples are predicted as true positive in the test set, the Weka predictions (based on data using unfiltered gene lists and without sample weighting) classified nearly half the liver cancer samples collected in EDTA tubes as false negatives and nearly all of the liver cancer samples as true positives (Figure 4). Our method was then applied to a new cohort of samples collected in PAXgene tubes representing ten different diseases. The panels predicted with consistently high performance under repeated 2-fold cross-validation. We expect, based on our previous experience with EDTA data that the prediction performance will hold when the sample number is increased. It may even be possible that with increased sample numbers, the gene panels may be refined and result in improved prediction performance. With this method, the risk of a single patient having any of the ten diseases studied can be obtained simultaneously using a single blood sample. Figure 7 is a schematic representation of the process by which we identify panels of genes that are predictive of disease conditions. These panels can then be applied to the data from a single individual to make predictions of risk for these conditions. Figure 7 Schematic representation multiple disease prediction. The gene expression from a reference population representing several disease conditions is filtered according to a Quality Assurance system based on repeatability data. These data are then analysed to derive predictive model for each disease condition. These models can then be applied to the data from a new sample to make risk prediction for this individual. We present the methodology described in this paper as a minimum set of procedures optimized for noisy data with individual component performance complementary to the other components. For example, the use of seemingly non-informative genes is suggested by the fundamental concept of suppressor variables, which dates back to 1941 and is similar to the differential amplifier of 1934 [24] or even the Wheatstone bridge of 1843 [25] and has been successfully applied in other areas of science and technology but appears to have been generally neglected by the community of scientists involved in genomic data analysis. We are convinced that a return to a more balanced holistic approach to data analysis may help in extracting useful information from the mass of data which can be obtained by rapidly advancing modern technology. The circulating peripheral blood system is involved in the regulation, coordination, metabolism and immune maintenance of all cells, tissues and organs. Functions of blood include transporting nutrients, oxygen and biomolecules, and removing cellular waste. Blood is further involved in immune surveillance throughout the body, and delivery of immune factors and mediators to sites of disease, infection and injury. Thus, the circulation and physiologically interactive nature of blood ensure that this system encounters, transmits, and is affected by a wide range of biological signals. Over the past decade, we have investigated the blood genetic signatures of a wide variety of diseases and conditions affecting numerous organs and functions, including psychiatric disorders, osteoarthritis, cardiovascular disease, gastrointestinal diseases, and cancer. It has been found that each disease has its characteristic expression spectrum of genetic signatures in peripheral blood, which make it possible to detect disease anywhere in the body by accessing subtle changes in blood RNA. Based on these studies, we address the Sentinel Principle® that views the circulating blood cells as “sentinels” for detecting and responding to micro-environmental changes in the body. Accordingly, the current state of health or disease of an organism is conveyed in the blood through interactions between circulating blood cells and the body’s cells, tissues and organs. Since blood samples can be readily obtained non-invasively, the genetic signature derived from blood RNA provides an alternative to tissue biopsy for determining the diagnosis and prognosis of many different diseases. Overcoming the problems of blood-based transcriptomics discussed above will further extend the application of the Sentinel Principle not only to the diagnosis of multiple diseases in one blood sample, but also to other fields of personalized medicine such as active surveillance, prognosis, drug response, and so on.

Document structure show

Title	Mining the Dynamic Genome: A Method for Identifying Multiple Disease Signatures Using Quantitative RNA Expression Analysis of a Single Blood Sample
Abstract	Background: Blood has advantages over tissue samples as a diagnostic tool, and blood mRNA transcriptomics is an exciting research field. To realize the full potential of blood transcriptomic investigations requires improved methods for gene expression measurement and data interpretation able to detect biological signatures within the “noisy” variability of whole blood. Methods: We demonstrate collection tube bias compensation during the process of identifying a liver cancer-specific gene signature. The candidate probe set list of liver cancer was filtered, based on previous repeatability performance obtained from technical replicates. We built a prediction model using differential pairs to reduce the impact of confounding factors. We compared prediction performance on an independent test set against prediction on an alternative model derived by Weka. The method was applied to an independent set of 157 blood samples collected in PAXgene tubes. Results: The model discriminated liver cancer equally well in both EDTA and PAXgene collected samples, whereas the Weka-derived model (using default settings) was not able to compensate for collection tube bias. Cross-validation results show our procedure predicted membership of each sample within the disease groups and healthy controls. Conclusion: Our versatile method for blood transcriptomic investigation overcomes several limitations hampering research in blood-based gene tests.
Body	1. Introduction Somatic Versus Dynamic Genome The human genome can be explored in two different dimensions: the somatic genome and the dynamic genome. The somatic genome is the heritable DNA structure of an organism, with mutational heterogeneity that can be either the cause or effect of disease. To investigate hereditable factors and somatic mutations in disease, researchers have explored the somatic genome, using such methods as DNA sequencing technologies, single nucleotide polymorphism arrays, and genome-wide association studies. The results have been biologically informative and have produced a few clear medical and clinical successes, such as BRCA and HER2 testing in breast cancer. However, in general, results have been disappointing [1]. A fundamental problem with cancer DNA genome studies is that the genetic mutations for any given cancer type discovered by modern sensitive analytical methods number in the hundreds, and the determination of the true somatic mutation(s) driving cancer progression is difficult in the clinical setting for any individual patient [2]. Furthermore, cancer therapies targeted at a single somatic mutation have proven to have limited effect over time, because of resistance caused by cellular somatic evolution [3]. As human cancer is usually polypoid, containing subpopulations of multiple aneuploid cancer tumor cells, aneuploidy is also the hallmark of cancer cells in general [4]. Ploidy sequencing has been proved to be beneficial in revealing the somatic evolution of cancer tumor cells [5]. However, the clinic application of DNA ploidy assessment is very limited for cancer detection because it is still controversial that aneuploidy is the cause of cancer [6]. The other approach—exploring the dynamic genome—is one that we believe to be more powerful for clinical applications. This approach investigates not the DNA of an organism, but the transcriptional activity of an organism’s genes. The result of this activity is the transcriptome: the complete set of RNA transcripts present in a cell or tissue at any one time. Although the DNA of a particular cell or tissue, the genome, is uniform throughout the organism and except for infrequent random mutations, essentially unchanging, its transcriptome may vary according to the current physiological status of the cell, tissue or organism. Since mRNA profiles will alter in response to the cellular environment, the transcriptome will always be changing in response to immune factors, drugs, disease onset and progression, and healing [7]. To date, the dynamic genome has best been interrogated using microarray studies. Microarray chips can provide a snapshot of an organism’s gene expression activity at a given time. Compared to RNA sequencing, microarray is more established and cost-effective in analyzing the expression of defined genes by high throughput methods. Furthermore, microarray data is not as complex as that of RNA sequencing, which make it easier to analyze and apply widely to various fields. However, traditional, tissue-based microarray studies have a number of disadvantages. Invasive biopsies can be obtained only in very late stage disease, at transplant or after death in the case of difficult-to-access organs such as lung, breast, prostate, heart, and brain. For these reasons, tissue-based microarray is less useful for research in early-stage disease, or in the clinic. One additional limiting factor in any tissue-based technology is the problem of heterogeneity. Diseased cells are not necessarily homogeneously distributed throughout tissue, and in cancer, malignant cells can differ from each other in their mutations [8]. Thus, analysis based on solid biopsy needs to take these factors into account by taking multiple invasive samples. However, this requirement increases the cost of the test and the test’s inconvenience to patients. To avoid this problem, so called “liquid biopsies” attempt to detect circulating tumor cells from a blood sample. While the presence of circulating tumor cells is a strong prognostic factor for overall survival in certain cancer patients, the clinical significance of circulating tumor cells in most patients is still unknown [9]. Furthermore since circulating tumor cells are very few in number in early stage cancer, analysis requires extreme analytical sensitivity to detect what very few cells are present and at a cost of many false positives, making testing unreliable. By contrast, white blood cells in peripheral blood provide a near ideal diagnostic sample. Blood sampling is a long-established and well-accepted procedure for disease diagnosis and monitoring. Whole blood is easy to access, and patients and physicians are accustomed to blood sampling. White blood cells are much more abundant than circulating tumor cells, which eliminates the challenge of analytical sensitivity. Furthermore, because the sample is liquid, the distribution of cells is homogeneous. This reduces or eliminates the need for multiple samples. Another advantage of using whole blood is that the immune cells in blood are biologically affected by disease located elsewhere in the body regardless of the tissue affected [10,11]. Thus the requirement for direct biopsy is reduced and even potentially eliminated. For these reasons blood-based transcriptomics has significant advantages over tissue-based biopsy technology. We propose this concept as the Sentinel Principle®, which employs the circulating blood cells as “sentinels” for detecting and responding to micro-environmental changes in the body. Blood-based transcriptomics may also come to play an important role in detecting disease at an early stage in which clinically apparent pathological variants have not yet emerged. Blood cells act as transporter cells and as mediators of the immune response and are involved in the pathogenesis of many diseases [12]. Thus when physiological or pathological insults occur anywhere in the organism, the gene expression profile of the peripheral blood cells will change in order to carry and to transfer information to engage the immune system and maintain physiological homeostasis [13]. Our previous research has shown that peripheral blood cells respond differently to various pathological changes and thus analysis of differential gene expression can distinguish between and among diseases [14]. Furthermore, since the interaction between the immune system and disease usually precedes the occurrence of clinically pathological variation, the study of blood cell profiles might make it possible to detect diseases at an early stage. For instance, according to the cancer immunoediting theory, cancer has a long equilibrium phase in which tumor cells survive immune elimination and maintain a state of functional tumor dormancy [15]. During the equilibrium phase, there is no clinically detectable pathological variation, but the interaction of the immune system against cancer cells occurs covertly in the body. Our hypothesis is that early stage cancer and other diseases can be detected by analysis of variation in the gene expression profiles of peripheral blood. Although blood-based transcriptomics research has great potential in clinical application, it has its own set of limitations. In order optimally to measure the expression levels of messenger RNA in whole blood, several challenges need to be overcome. First, whole blood introduces into sampling the factor of white blood cell population distribution. Second, the dynamic composition of blood cells in response to a constantly changing environment means that even copy numbers normalized to total cell count and sample volume may exhibit variability. Third, sampling technology can introduce additional artefacts, such as differences between EDTA and PAXgene™ (PreAnalytiX) collection tubes and protocols. Moreover, microarrays from different manufacturers each have their own peculiarities and need corrective measures tailored for each. These factors make the analysis of the data more complex and time consuming. A practical solution is required that can identify well-performing gene panels in the face of these challenges. The most common of these limitations, the interference derived from different blood sampling technologies, cannot be ignored in blood-based transcriptomics research. The conventional method for drawing blood uses EDTA collection tubes, which inhibit clotting but do not stabilise intracellular RNA. When EDTA blood collection tubes are used, intracellular RNA needs to be isolated within four hours of collection, as RNA degrades rapidly. Therefore, EDTA collection tubes are not practical for clinical applications when tests involve RNA and the collection sites are far from the laboratory. To overcome this problem, PAXgene tubes contain reagents that stabilise RNA, allowing easy blood collection, storage and transport of blood samples [16]. However, excessive globin mRNA levels interfere with transcript measurement and increase variability. This difficulty is addressed by the use of specifically designed reagents with different degree of globin signal suppression. Thus, gene expression profiles derived from EDTA and PAXgene blood collection tubes are not completely consistent between samples drawn from the same patient and processed under different protocols, an inconsistency that may lead to confusing or contradictory results. 2. Experimental Section In an earlier study we published the predictive performance of gene panels identified by the method developed by our group using a data set consisting of 631 blood samples collected in EDTA tubes and discriminating for three diseases with an area under the receiving operator characteristic curve (AUROC) ranging from 89% to 93% [17]. We later expanded the set to include 17 diseases represented by more than 1700 samples, and we were able to obtain similar prediction performance. However, because our samples were collected using EDTA tubes, the results would not be useful for many clinical applications, as discussed above. We have therefore transitioned to PAXgene tube collection and have started to rebuild our disease panels. This offered us an opportunity to evaluate the performance of our method against an established statistical package, Weka [18], for a study in which samples were collected initially using EDTA tubes and then PAXgene tubes. In this two-part study, we first present details of our method and demonstrate its ability to overcome a known and documented bias between samples collected in EDTA and PAXgene tubes. The PAXgene samples were processed using Nugen reagent kit (NuGen, San Carlos, CA, USA). By extension, this method should be able to overcome other biases, which may not be known in advance. Then, we present the results of this method applied to a new cohort of samples representing ten cancers and diseases collected in PAXgene tubes and processed using the current 3′ IVT PLUS reagent kit (Affymetrix, Santa Clara, CA, USA). We switched from the Nugen kit because the new reagent kit has better repeatability performance in our laboratories. 2.1. Demonstration of EDTA and PAXgene Collection Tube Bias Suppression For the initial part of the study, we used blood samples that we collected for a liver cancer (hepatocellular cancer) study. The samples comprised blood taken from hepatocellular cancer (HCC) patients, chronic Hepatitis B (HpB) patients and healthy controls, all recruited in Malaysia under approved protocols [19]. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki. Our training set consisted of HCC patients: 26 HCC collected in PAXgene tubes and 20 HCC collected in EDTA tubes. We also had 28 blood samples taken from patients with chronic hepatitis B (HpB) infection collected in PAXgene tubes, 28 confirmed HCC-negative (control) samples also in PAXgene tubes and seven control samples collected in EDTA tubes. In addition, 830 samples from other studies (“other”) were collected in EDTA tubes. These “other” samples were assumed to be negative for HCC because it is a low-prevalence disease. The test set consisted of independent samples: 25 HCC collected in PAXgene tubes, 15 HCC collected in EDTA tubes, 27 HpB collected in PAXgene tubes, 27 controls collected in PAXgene tubes, 7 controls collected in EDTA tubes and 860 “other” samples collected in EDTA tubes (Table 1). microarrays-04-00671-t001_Table 1 Table 1 Breakdown of samples for collection tube bias demonstration. 2.2. New Samples and Cross-Validation After we demonstrated that this method works well in separating the three groups of liver cancer study samples (HCC, HpB, control), we proceeded to apply the same method to a new set of 157 samples representing ten diseases and healthy controls collected and processed in Penang and Shanghai (Table 2). microarrays-04-00671-t002_Table 2 Table 2 New samples for cross-validation. 2.3. Methods 2.3.1. Blood Collection, RNA Isolation and RNA Quality Control Peripheral whole blood was collected from patients in EDTA Vacutainer tubes (Becton Dickinson, Franklin Lakes, NJ, USA) and PAXgene tubes (PreAnalytix, Hombrechtikon, Switzerland). Whole blood RNA was isolated as described previously [20]. Isolated RNA was checked by using 2100 Bioanalyzer RNA 6000 Nano Chips (Agilent Technologies, Santa Clara, CA, USA). Samples were excluded from microarray analysis that did not meet the following quality criteria: RIN ≥ 7.0; 28S:18S rRNA ≥ 1.0. RNA quantity was determined by absorbance at 260 nm in a DU800 Spectrophotometer (Beckman Coulter, Brea, CA, USA) or a NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Scientific, Wilmington, DE, USA). 2.3.2. Microarray Hybridization and Probe Set Quality We used the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray (Affymetrix) and the FDA-cleared, CE-IVD marked Affymetrix Gene Profiling Array cGMP U133 P2 microarray (Affymetrix) for this study. Expression data was extracted by using the MAS5 analysis method, as each microarray needs to be evaluated independently. This technique is directed towards allowing single sample predictions, which are more practical in a clinical setting. We followed the MAQC list for the Affymetrix GenChip Human Genome U133 Plus 2.0 microarray and ignored probe sets that are not included in the list of MAQC for Affymetrix microarrays [21]. We then conducted an experiment to identify probe sets which exhibit stable results across 7 different chip lots using 4 replicates each, for a total of 28 hybridizations on a pooled mRNA sample extracted from whole blood (Table 3). Probe sets were ranked according to observed variability across the 28 hybridizations. microarrays-04-00671-t003_Table 3 Table 3 Four replicates across seven microarray lots. These findings were confirmed by two individual non-pooled samples, each run twice (Table 4). For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. microarrays-04-00671-t004_Table 4 Table 4 Non-pooled samples replicates. For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. Probe sets with expression levels lower than 100 were classified as too “noisy” based on the data from these repeated hybridizations, whereas those with expression levels greater than 10,000 were classified as “saturated” and unreliable for detecting change in expression (Figure 1). The probe sets must also belong on the validated list published by the MAQC study as well as verified to be repeatable on our own EDTA and PAXgene replicate experiments. Finally, any outliers are also excluded because of their uncharacteristic expression value. These steps are summarized in Table 5. Figure 1 Technical Replicate Hybridization. (a,b) Correlation between two replicates hybridizations of samples N23C and N82B; (c) distribution of replicate ratios for probe sets within the “filtered” list. microarrays-04-00671-t005_Table 5 Table 5 Probe Set Filtering. 2.3.3. Pairs of Genes Pairs of genes are the minimum unit for analysis when using self-normalization to suppress confounding factors, such as the diurnal cycle [22]. From the stable probe sets identified in the previous steps, pairs of genes were evaluated as ratios. A pair with an AUROC value of 0.7 or higher was classified as “significant”, and the pair was set aside as a candidate biomarker pair for further combination analysis. This AUROC value was selected from empirical experience as a good balance between potentially excluding valid probe sets and accepting those that would eventually prove to be unreliable in further evaluations. Pairs using the sum of the gene expression were also set aside as candidates if the pair had complementary noise which was reduced by the sum, as evidenced by a significant increase in AUROC above that of the individual genes. Additionally, if a pair had little correlation with the disease under study (AUROC~0.5), but showed good correlation with the significant pairs, then the pair was also set aside as a potential “suppressor” pair [23]. Finally, combinations of candidate biomarker and suppressor pairs were evaluated by AUROC and a short list was selected for validation on a test set or by multiple iterations of n-fold cross-validation. The concept of using pairs of genes is similar to the practice of using differential signals in electrocardiogram (EKG) and electroencephalogram (EEG) measurements (Figure 2). The desired signal is obscured by the electrical noise from both the external environment and spurious muscle contractions in the body of the patient. However, by selecting appropriate reference points to obtain a “suppressor” signal, it is possible to optimally recover the desired signal. Figure 2 EKG pair is “suppressor” used with “raw” EEG pair to obtain a clean EEG signal with reduced EKG artefacts. Specifically selected raw signals which seem to be useless noise, are combined to suppress masking noise, revealing the underlying useful information that was always present. 2.3.4. One-Against-All (Orthogonality) One of the difficulties of using whole blood is that its transcriptome contains information reflecting not only many diseases but also all kinds of other confounding factors. However, this breadth of information is also the key to the solution of the problem. Since so many factors are mirrored in the blood transcriptome, it should be possible to find a signature for each, and reduce the problem to a set of “independent” or orthogonal equations for which the solution becomes nearly trivial. This solution is based on our hypothesis that, whereas many genes are affected by more than one disease or condition, there may exist combinations of genes that are affected only by a single condition. By setting up an analysis to look for only those combinations of genes that respond to a single condition in the explicit presence of confounding factors such as other diseases, we will be able to identify those genes that match the independent equation case. One beneficial side effect of this solution is that sample acquisition becomes much simpler; it is not necessary to find samples from patients with two or more diseases or conditions of interest. The practical consequence is not trivial: for diseases with very low prevalence rates, patients with multiple disease combinations would be vanishingly rare and impossible to acquire. We assign the “other” samples to the “not-this-disease” group and take advantage of the relatively larger numbers to attenuate any “out of the ordinary” characteristics of an individual sample. As the number increases, the potential skew from any one individual is diluted. Additionally, the gene panel is trained to reject the signature of all other diseases included in the “other” samples. This is the “one-against-all” approach for analysing gene expression profiles that makes each prediction panel more specific to the target disease condition. 2.3.5. Group Balance A side effect of employing the one-against-all approach is that clinical information about a patient in a research study is usually limited to the condition being studied. For instance, in a colorectal cancer study, all the patients under study will be endoscopically examined and determined to either have colorectal cancer (case) or to be free from colorectal cancer (control). However, it is not usually possible to know for certain whether patients from other studies are truly free from colorectal cancer. Researchers can only assume that it would be unlikely for these patients to have colorectal cancer, which is a disease with a low prevalence (<1%). The problem is that there are many more of these patients with unconfirmed diagnoses as compared with the colonoscopy-verified cancer-free patients. It might be possible to incorrectly predict all confirmed colorectal cancer negative patients as false positive and still achieve a very high specificity by correctly predicting the samples from other studies as colorectal cancer-free. To account for such bias, statisticians weigh more heavily the relative contribution of high-quality data (verified pathology) relative to the larger amount of low-quality data (assumed colorectal cancer-negative pathology). In our approach, this accounting for bias is achieved by replicating the samples that need to be weighted more heavily. We chose replication because it has the added benefit that we can introduce a controlled amount of random Gaussian noise to simulate the effect of measurement uncertainty and reduce the impact of any single data point that might skew the results. To increase the contribution of the confirmed non-cancer cases, we replicated each PAXgene HCC sample 15 times and each EDTA HCC sample 20 times, to balance them with each other and with the samples from other studies. (Table 6). microarrays-04-00671-t006_Table 6 Table 6 Training Set Group Balance. The two horizontal red arrows indicate the balance between PAXgene and EDTA collected data, the in-between vertical green arrow indicates the balance between HCC and Control subgroups. The two vertical green arrows indicate the balance between Cancer and Control/Other subgroups 2.3.6. Search Speed Optimization However, the solution proposed above has the consequence of increasing both the size of the data set and the number of potential combinations to be evaluated to the point that it becomes too time-consuming to conduct a systematic search. Since the goal is only to find some combinations of genes that predict disease well enough to be useful, we used a Monte-Carlo approach to accelerate the search. The efficiency of a random Monte-Carlo evaluation can be illustrated by comparing the numerical estimation of the value of π with and without Monte-Carlo acceleration. The mathematical constant, π, represents the ratio of the area within a circle to the square of its radius. For a circle of unit radius with an enclosed area of π units squared, the enclosing square has sides of 2 units with an area of 4 units squared. A systematic evaluation would divide the enclosing square into a grid with regularly spaced points and determine whether each of these points is within the circle or outside it. The ratio of the number of points within the circle to the number of points inside the square is an approximation of π/4. A random search would select points at random rather than systematically march along the grid. As illustrated in Figure 3, the Monte-Carlo estimate comes within 5% of the correct value much more rapidly than the systematic evaluation. 2.3.7. New Samples and Cross-Validation The method was applied to a group of blood samples from ten different diseases and controls. We used the entire data set of ten diseases and controls to identify gene expression biomarkers associated with each of the ten diseases using the statistical method described above (one-against-all). This is the first data set collected entirely in PAXgene tubes. Because the sample numbers are still small, it is not practical to evaluate prediction performance by partition into a traditional training/test sets. Instead, we performed a 2-fold cross-validation iterated 1000 times. Figure 3 Estimating the value of π numerically. (a) Convergence speed using a systematic search (b) and a Monte-Carlo random search (c). 3. Results and Discussion We applied these procedures to a set of samples that had the complication that the samples had been collected using two different technologies: EDTA and PAXgene tubes. Although the raw expression levels exhibited differences between these two sets, using our strategy it was possible to find combinations of pairs that exhibited improved stability and were predictive of the underlying disease. The search identified a combination of 12 probe sets in six pairs from the filtered list of 586 candidates. This combination scored the samples with fairly good overlap between the PAXgene and EDTA samples in each disease category in both the training and independent test sets (Figure 4). For comparison we also plotted the predictions using a standard open-source statistical package, Weka. This analysis was conducted using the SimpleLogistic classifier function with default parameters employing all available 9163 probe set data without any filtering or weighting. The SimpleLogistic classifier has a built-in feature selection capability that selected the final set of 17 probe sets from the entire data (WekaRaw17Gene). This comparison using an unmodified version of Weka under default settings is not intended to critique Weka, but rather to highlight the effect of the additional steps described in this manuscript. Figure 4 Prediction scores using the method described in this paper (LogReg_6Pairs) Weka prediction using all data without any preprocessing (Weka Raw17Gene). The test set results from the WekaRaw17Gene panel trained without sample weighting show that the liver cancer samples collected in EDTA tubes (median = −0.50) have dropped in prediction scores relative to the PAXgene cancer samples (median = +0.28) and are now in the same range as chronic hepatitis B (median = −0.74) and control patients (median = −0.94). By contrast, the predictions using our method aligns both liver cancer groups (EDTA median = +0.80, PAXgene median = +0.77) which are well separated from the HpB samples (median = −0.38) and control samples (EDTA median = −1.34, PAXgene median = −0.57). These results illustrate that the novel statistical method described above worked well in separating the three groups of liver cancer study samples (HCC, HpB, control), even under different conditions, such as chip lot or blood collection tubes. Additionally, the loss of test set prediction accuracy resulting from the absence of the PAXgene HpB and other subgroups of samples can be overcome by using the information from other available PAXgene subgroups, as shown in the results presented in Figure 4. We then applied the same procedures to a group of blood samples from ten different diseases and controls collected in PAXgene tubes (Table 2). The results of 1000 iterations of 2-fold cross-validation for each disease’s gene pair panel are summarised in Table 7. When this method was used, all ten disease gene pair panels showed consistently high prediction performance: high sensitivity (mean 89%), high specificity for the healthy controls (mean 98%), high specificity for the other nine diseases (93%), and high AUROC (mean 96%). The prediction results for each individual subject for the risk of colorectal cancer are charted in Figure 5, which shows good discrimination between colorectal cancer and controls and the other nine diseases. All but one colorectal cancer subject achieved a score above the threshold value of 0 while only four of all the other subjects returned a false positive prediction. The other gene panels achieved similar results. Figure 6 is an example of the prediction of the risk of ten different diseases for a single liver cancer patient using the gene pair panels obtained using the method described in this paper. This liver cancer patient can be seen to be at high risk for liver cancer and at no higher than population average risk for the other nine diseases. microarrays-04-00671-t007_Table 7 Table 7 Performance of gene pair panels of each disease using our statistical method (1000 iterations of 2-fold cross validation). Figure 5 Prediction of risk for colorectal cancer for individual subjects using the colorectal cancer gene pair panel identified by the method described in this paper. Figure 6 Prediction of the risk of 10 different diseases for an individual liver cancer patient, using the gene pair panels obtained using the method described in this paper. This patient was known to have liver cancer and had no indication of any of the other diseases being evaluated. 4. Conclusions We have presented a procedure that identifies a set of probe sets which demonstrate reliable expression levels for target genes. Using these, we evaluated ratio pairs to achieve self-normalization. By combining discriminative pairs and suppressor pairs, we found useful panels of gene pairs that are able to predict disease even across varying conditions such as chip lot or sample collection tube differences. For comparison, we also processed the data with a widely-used machine learning package, Weka, using the SimpleLogistic model with automatic feature selection. Weka achieved the best overall discrimination with a panel of 17 probe sets, but was unable to suppress the bias introduced by the use of two different collection tubes. That is, whereas our method managed to align the liver cancer samples so that the majority (~75%) of both EDTA and PAXgene samples are predicted as true positive in the test set, the Weka predictions (based on data using unfiltered gene lists and without sample weighting) classified nearly half the liver cancer samples collected in EDTA tubes as false negatives and nearly all of the liver cancer samples as true positives (Figure 4). Our method was then applied to a new cohort of samples collected in PAXgene tubes representing ten different diseases. The panels predicted with consistently high performance under repeated 2-fold cross-validation. We expect, based on our previous experience with EDTA data that the prediction performance will hold when the sample number is increased. It may even be possible that with increased sample numbers, the gene panels may be refined and result in improved prediction performance. With this method, the risk of a single patient having any of the ten diseases studied can be obtained simultaneously using a single blood sample. Figure 7 is a schematic representation of the process by which we identify panels of genes that are predictive of disease conditions. These panels can then be applied to the data from a single individual to make predictions of risk for these conditions. Figure 7 Schematic representation multiple disease prediction. The gene expression from a reference population representing several disease conditions is filtered according to a Quality Assurance system based on repeatability data. These data are then analysed to derive predictive model for each disease condition. These models can then be applied to the data from a new sample to make risk prediction for this individual. We present the methodology described in this paper as a minimum set of procedures optimized for noisy data with individual component performance complementary to the other components. For example, the use of seemingly non-informative genes is suggested by the fundamental concept of suppressor variables, which dates back to 1941 and is similar to the differential amplifier of 1934 [24] or even the Wheatstone bridge of 1843 [25] and has been successfully applied in other areas of science and technology but appears to have been generally neglected by the community of scientists involved in genomic data analysis. We are convinced that a return to a more balanced holistic approach to data analysis may help in extracting useful information from the mass of data which can be obtained by rapidly advancing modern technology. The circulating peripheral blood system is involved in the regulation, coordination, metabolism and immune maintenance of all cells, tissues and organs. Functions of blood include transporting nutrients, oxygen and biomolecules, and removing cellular waste. Blood is further involved in immune surveillance throughout the body, and delivery of immune factors and mediators to sites of disease, infection and injury. Thus, the circulation and physiologically interactive nature of blood ensure that this system encounters, transmits, and is affected by a wide range of biological signals. Over the past decade, we have investigated the blood genetic signatures of a wide variety of diseases and conditions affecting numerous organs and functions, including psychiatric disorders, osteoarthritis, cardiovascular disease, gastrointestinal diseases, and cancer. It has been found that each disease has its characteristic expression spectrum of genetic signatures in peripheral blood, which make it possible to detect disease anywhere in the body by accessing subtle changes in blood RNA. Based on these studies, we address the Sentinel Principle® that views the circulating blood cells as “sentinels” for detecting and responding to micro-environmental changes in the body. Accordingly, the current state of health or disease of an organism is conveyed in the blood through interactions between circulating blood cells and the body’s cells, tissues and organs. Since blood samples can be readily obtained non-invasively, the genetic signature derived from blood RNA provides an alternative to tissue biopsy for determining the diagnosis and prognosis of many different diseases. Overcoming the problems of blood-based transcriptomics discussed above will further extend the application of the Sentinel Principle not only to the diagnosis of multiple diseases in one blood sample, but also to other fields of personalized medicine such as active surveillance, prognosis, drug response, and so on.
Section	1. Introduction Somatic Versus Dynamic Genome The human genome can be explored in two different dimensions: the somatic genome and the dynamic genome. The somatic genome is the heritable DNA structure of an organism, with mutational heterogeneity that can be either the cause or effect of disease. To investigate hereditable factors and somatic mutations in disease, researchers have explored the somatic genome, using such methods as DNA sequencing technologies, single nucleotide polymorphism arrays, and genome-wide association studies. The results have been biologically informative and have produced a few clear medical and clinical successes, such as BRCA and HER2 testing in breast cancer. However, in general, results have been disappointing [1]. A fundamental problem with cancer DNA genome studies is that the genetic mutations for any given cancer type discovered by modern sensitive analytical methods number in the hundreds, and the determination of the true somatic mutation(s) driving cancer progression is difficult in the clinical setting for any individual patient [2]. Furthermore, cancer therapies targeted at a single somatic mutation have proven to have limited effect over time, because of resistance caused by cellular somatic evolution [3]. As human cancer is usually polypoid, containing subpopulations of multiple aneuploid cancer tumor cells, aneuploidy is also the hallmark of cancer cells in general [4]. Ploidy sequencing has been proved to be beneficial in revealing the somatic evolution of cancer tumor cells [5]. However, the clinic application of DNA ploidy assessment is very limited for cancer detection because it is still controversial that aneuploidy is the cause of cancer [6]. The other approach—exploring the dynamic genome—is one that we believe to be more powerful for clinical applications. This approach investigates not the DNA of an organism, but the transcriptional activity of an organism’s genes. The result of this activity is the transcriptome: the complete set of RNA transcripts present in a cell or tissue at any one time. Although the DNA of a particular cell or tissue, the genome, is uniform throughout the organism and except for infrequent random mutations, essentially unchanging, its transcriptome may vary according to the current physiological status of the cell, tissue or organism. Since mRNA profiles will alter in response to the cellular environment, the transcriptome will always be changing in response to immune factors, drugs, disease onset and progression, and healing [7]. To date, the dynamic genome has best been interrogated using microarray studies. Microarray chips can provide a snapshot of an organism’s gene expression activity at a given time. Compared to RNA sequencing, microarray is more established and cost-effective in analyzing the expression of defined genes by high throughput methods. Furthermore, microarray data is not as complex as that of RNA sequencing, which make it easier to analyze and apply widely to various fields. However, traditional, tissue-based microarray studies have a number of disadvantages. Invasive biopsies can be obtained only in very late stage disease, at transplant or after death in the case of difficult-to-access organs such as lung, breast, prostate, heart, and brain. For these reasons, tissue-based microarray is less useful for research in early-stage disease, or in the clinic. One additional limiting factor in any tissue-based technology is the problem of heterogeneity. Diseased cells are not necessarily homogeneously distributed throughout tissue, and in cancer, malignant cells can differ from each other in their mutations [8]. Thus, analysis based on solid biopsy needs to take these factors into account by taking multiple invasive samples. However, this requirement increases the cost of the test and the test’s inconvenience to patients. To avoid this problem, so called “liquid biopsies” attempt to detect circulating tumor cells from a blood sample. While the presence of circulating tumor cells is a strong prognostic factor for overall survival in certain cancer patients, the clinical significance of circulating tumor cells in most patients is still unknown [9]. Furthermore since circulating tumor cells are very few in number in early stage cancer, analysis requires extreme analytical sensitivity to detect what very few cells are present and at a cost of many false positives, making testing unreliable. By contrast, white blood cells in peripheral blood provide a near ideal diagnostic sample. Blood sampling is a long-established and well-accepted procedure for disease diagnosis and monitoring. Whole blood is easy to access, and patients and physicians are accustomed to blood sampling. White blood cells are much more abundant than circulating tumor cells, which eliminates the challenge of analytical sensitivity. Furthermore, because the sample is liquid, the distribution of cells is homogeneous. This reduces or eliminates the need for multiple samples. Another advantage of using whole blood is that the immune cells in blood are biologically affected by disease located elsewhere in the body regardless of the tissue affected [10,11]. Thus the requirement for direct biopsy is reduced and even potentially eliminated. For these reasons blood-based transcriptomics has significant advantages over tissue-based biopsy technology. We propose this concept as the Sentinel Principle®, which employs the circulating blood cells as “sentinels” for detecting and responding to micro-environmental changes in the body. Blood-based transcriptomics may also come to play an important role in detecting disease at an early stage in which clinically apparent pathological variants have not yet emerged. Blood cells act as transporter cells and as mediators of the immune response and are involved in the pathogenesis of many diseases [12]. Thus when physiological or pathological insults occur anywhere in the organism, the gene expression profile of the peripheral blood cells will change in order to carry and to transfer information to engage the immune system and maintain physiological homeostasis [13]. Our previous research has shown that peripheral blood cells respond differently to various pathological changes and thus analysis of differential gene expression can distinguish between and among diseases [14]. Furthermore, since the interaction between the immune system and disease usually precedes the occurrence of clinically pathological variation, the study of blood cell profiles might make it possible to detect diseases at an early stage. For instance, according to the cancer immunoediting theory, cancer has a long equilibrium phase in which tumor cells survive immune elimination and maintain a state of functional tumor dormancy [15]. During the equilibrium phase, there is no clinically detectable pathological variation, but the interaction of the immune system against cancer cells occurs covertly in the body. Our hypothesis is that early stage cancer and other diseases can be detected by analysis of variation in the gene expression profiles of peripheral blood. Although blood-based transcriptomics research has great potential in clinical application, it has its own set of limitations. In order optimally to measure the expression levels of messenger RNA in whole blood, several challenges need to be overcome. First, whole blood introduces into sampling the factor of white blood cell population distribution. Second, the dynamic composition of blood cells in response to a constantly changing environment means that even copy numbers normalized to total cell count and sample volume may exhibit variability. Third, sampling technology can introduce additional artefacts, such as differences between EDTA and PAXgene™ (PreAnalytiX) collection tubes and protocols. Moreover, microarrays from different manufacturers each have their own peculiarities and need corrective measures tailored for each. These factors make the analysis of the data more complex and time consuming. A practical solution is required that can identify well-performing gene panels in the face of these challenges. The most common of these limitations, the interference derived from different blood sampling technologies, cannot be ignored in blood-based transcriptomics research. The conventional method for drawing blood uses EDTA collection tubes, which inhibit clotting but do not stabilise intracellular RNA. When EDTA blood collection tubes are used, intracellular RNA needs to be isolated within four hours of collection, as RNA degrades rapidly. Therefore, EDTA collection tubes are not practical for clinical applications when tests involve RNA and the collection sites are far from the laboratory. To overcome this problem, PAXgene tubes contain reagents that stabilise RNA, allowing easy blood collection, storage and transport of blood samples [16]. However, excessive globin mRNA levels interfere with transcript measurement and increase variability. This difficulty is addressed by the use of specifically designed reagents with different degree of globin signal suppression. Thus, gene expression profiles derived from EDTA and PAXgene blood collection tubes are not completely consistent between samples drawn from the same patient and processed under different protocols, an inconsistency that may lead to confusing or contradictory results.
Title	1. Introduction
Section	Somatic Versus Dynamic Genome The human genome can be explored in two different dimensions: the somatic genome and the dynamic genome. The somatic genome is the heritable DNA structure of an organism, with mutational heterogeneity that can be either the cause or effect of disease. To investigate hereditable factors and somatic mutations in disease, researchers have explored the somatic genome, using such methods as DNA sequencing technologies, single nucleotide polymorphism arrays, and genome-wide association studies. The results have been biologically informative and have produced a few clear medical and clinical successes, such as BRCA and HER2 testing in breast cancer. However, in general, results have been disappointing [1]. A fundamental problem with cancer DNA genome studies is that the genetic mutations for any given cancer type discovered by modern sensitive analytical methods number in the hundreds, and the determination of the true somatic mutation(s) driving cancer progression is difficult in the clinical setting for any individual patient [2]. Furthermore, cancer therapies targeted at a single somatic mutation have proven to have limited effect over time, because of resistance caused by cellular somatic evolution [3]. As human cancer is usually polypoid, containing subpopulations of multiple aneuploid cancer tumor cells, aneuploidy is also the hallmark of cancer cells in general [4]. Ploidy sequencing has been proved to be beneficial in revealing the somatic evolution of cancer tumor cells [5]. However, the clinic application of DNA ploidy assessment is very limited for cancer detection because it is still controversial that aneuploidy is the cause of cancer [6]. The other approach—exploring the dynamic genome—is one that we believe to be more powerful for clinical applications. This approach investigates not the DNA of an organism, but the transcriptional activity of an organism’s genes. The result of this activity is the transcriptome: the complete set of RNA transcripts present in a cell or tissue at any one time. Although the DNA of a particular cell or tissue, the genome, is uniform throughout the organism and except for infrequent random mutations, essentially unchanging, its transcriptome may vary according to the current physiological status of the cell, tissue or organism. Since mRNA profiles will alter in response to the cellular environment, the transcriptome will always be changing in response to immune factors, drugs, disease onset and progression, and healing [7]. To date, the dynamic genome has best been interrogated using microarray studies. Microarray chips can provide a snapshot of an organism’s gene expression activity at a given time. Compared to RNA sequencing, microarray is more established and cost-effective in analyzing the expression of defined genes by high throughput methods. Furthermore, microarray data is not as complex as that of RNA sequencing, which make it easier to analyze and apply widely to various fields. However, traditional, tissue-based microarray studies have a number of disadvantages. Invasive biopsies can be obtained only in very late stage disease, at transplant or after death in the case of difficult-to-access organs such as lung, breast, prostate, heart, and brain. For these reasons, tissue-based microarray is less useful for research in early-stage disease, or in the clinic. One additional limiting factor in any tissue-based technology is the problem of heterogeneity. Diseased cells are not necessarily homogeneously distributed throughout tissue, and in cancer, malignant cells can differ from each other in their mutations [8]. Thus, analysis based on solid biopsy needs to take these factors into account by taking multiple invasive samples. However, this requirement increases the cost of the test and the test’s inconvenience to patients. To avoid this problem, so called “liquid biopsies” attempt to detect circulating tumor cells from a blood sample. While the presence of circulating tumor cells is a strong prognostic factor for overall survival in certain cancer patients, the clinical significance of circulating tumor cells in most patients is still unknown [9]. Furthermore since circulating tumor cells are very few in number in early stage cancer, analysis requires extreme analytical sensitivity to detect what very few cells are present and at a cost of many false positives, making testing unreliable. By contrast, white blood cells in peripheral blood provide a near ideal diagnostic sample. Blood sampling is a long-established and well-accepted procedure for disease diagnosis and monitoring. Whole blood is easy to access, and patients and physicians are accustomed to blood sampling. White blood cells are much more abundant than circulating tumor cells, which eliminates the challenge of analytical sensitivity. Furthermore, because the sample is liquid, the distribution of cells is homogeneous. This reduces or eliminates the need for multiple samples. Another advantage of using whole blood is that the immune cells in blood are biologically affected by disease located elsewhere in the body regardless of the tissue affected [10,11]. Thus the requirement for direct biopsy is reduced and even potentially eliminated. For these reasons blood-based transcriptomics has significant advantages over tissue-based biopsy technology. We propose this concept as the Sentinel Principle®, which employs the circulating blood cells as “sentinels” for detecting and responding to micro-environmental changes in the body. Blood-based transcriptomics may also come to play an important role in detecting disease at an early stage in which clinically apparent pathological variants have not yet emerged. Blood cells act as transporter cells and as mediators of the immune response and are involved in the pathogenesis of many diseases [12]. Thus when physiological or pathological insults occur anywhere in the organism, the gene expression profile of the peripheral blood cells will change in order to carry and to transfer information to engage the immune system and maintain physiological homeostasis [13]. Our previous research has shown that peripheral blood cells respond differently to various pathological changes and thus analysis of differential gene expression can distinguish between and among diseases [14]. Furthermore, since the interaction between the immune system and disease usually precedes the occurrence of clinically pathological variation, the study of blood cell profiles might make it possible to detect diseases at an early stage. For instance, according to the cancer immunoediting theory, cancer has a long equilibrium phase in which tumor cells survive immune elimination and maintain a state of functional tumor dormancy [15]. During the equilibrium phase, there is no clinically detectable pathological variation, but the interaction of the immune system against cancer cells occurs covertly in the body. Our hypothesis is that early stage cancer and other diseases can be detected by analysis of variation in the gene expression profiles of peripheral blood. Although blood-based transcriptomics research has great potential in clinical application, it has its own set of limitations. In order optimally to measure the expression levels of messenger RNA in whole blood, several challenges need to be overcome. First, whole blood introduces into sampling the factor of white blood cell population distribution. Second, the dynamic composition of blood cells in response to a constantly changing environment means that even copy numbers normalized to total cell count and sample volume may exhibit variability. Third, sampling technology can introduce additional artefacts, such as differences between EDTA and PAXgene™ (PreAnalytiX) collection tubes and protocols. Moreover, microarrays from different manufacturers each have their own peculiarities and need corrective measures tailored for each. These factors make the analysis of the data more complex and time consuming. A practical solution is required that can identify well-performing gene panels in the face of these challenges. The most common of these limitations, the interference derived from different blood sampling technologies, cannot be ignored in blood-based transcriptomics research. The conventional method for drawing blood uses EDTA collection tubes, which inhibit clotting but do not stabilise intracellular RNA. When EDTA blood collection tubes are used, intracellular RNA needs to be isolated within four hours of collection, as RNA degrades rapidly. Therefore, EDTA collection tubes are not practical for clinical applications when tests involve RNA and the collection sites are far from the laboratory. To overcome this problem, PAXgene tubes contain reagents that stabilise RNA, allowing easy blood collection, storage and transport of blood samples [16]. However, excessive globin mRNA levels interfere with transcript measurement and increase variability. This difficulty is addressed by the use of specifically designed reagents with different degree of globin signal suppression. Thus, gene expression profiles derived from EDTA and PAXgene blood collection tubes are not completely consistent between samples drawn from the same patient and processed under different protocols, an inconsistency that may lead to confusing or contradictory results.
Title	Somatic Versus Dynamic Genome
Section	2. Experimental Section In an earlier study we published the predictive performance of gene panels identified by the method developed by our group using a data set consisting of 631 blood samples collected in EDTA tubes and discriminating for three diseases with an area under the receiving operator characteristic curve (AUROC) ranging from 89% to 93% [17]. We later expanded the set to include 17 diseases represented by more than 1700 samples, and we were able to obtain similar prediction performance. However, because our samples were collected using EDTA tubes, the results would not be useful for many clinical applications, as discussed above. We have therefore transitioned to PAXgene tube collection and have started to rebuild our disease panels. This offered us an opportunity to evaluate the performance of our method against an established statistical package, Weka [18], for a study in which samples were collected initially using EDTA tubes and then PAXgene tubes. In this two-part study, we first present details of our method and demonstrate its ability to overcome a known and documented bias between samples collected in EDTA and PAXgene tubes. The PAXgene samples were processed using Nugen reagent kit (NuGen, San Carlos, CA, USA). By extension, this method should be able to overcome other biases, which may not be known in advance. Then, we present the results of this method applied to a new cohort of samples representing ten cancers and diseases collected in PAXgene tubes and processed using the current 3′ IVT PLUS reagent kit (Affymetrix, Santa Clara, CA, USA). We switched from the Nugen kit because the new reagent kit has better repeatability performance in our laboratories. 2.1. Demonstration of EDTA and PAXgene Collection Tube Bias Suppression For the initial part of the study, we used blood samples that we collected for a liver cancer (hepatocellular cancer) study. The samples comprised blood taken from hepatocellular cancer (HCC) patients, chronic Hepatitis B (HpB) patients and healthy controls, all recruited in Malaysia under approved protocols [19]. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki. Our training set consisted of HCC patients: 26 HCC collected in PAXgene tubes and 20 HCC collected in EDTA tubes. We also had 28 blood samples taken from patients with chronic hepatitis B (HpB) infection collected in PAXgene tubes, 28 confirmed HCC-negative (control) samples also in PAXgene tubes and seven control samples collected in EDTA tubes. In addition, 830 samples from other studies (“other”) were collected in EDTA tubes. These “other” samples were assumed to be negative for HCC because it is a low-prevalence disease. The test set consisted of independent samples: 25 HCC collected in PAXgene tubes, 15 HCC collected in EDTA tubes, 27 HpB collected in PAXgene tubes, 27 controls collected in PAXgene tubes, 7 controls collected in EDTA tubes and 860 “other” samples collected in EDTA tubes (Table 1). microarrays-04-00671-t001_Table 1 Table 1 Breakdown of samples for collection tube bias demonstration. 2.2. New Samples and Cross-Validation After we demonstrated that this method works well in separating the three groups of liver cancer study samples (HCC, HpB, control), we proceeded to apply the same method to a new set of 157 samples representing ten diseases and healthy controls collected and processed in Penang and Shanghai (Table 2). microarrays-04-00671-t002_Table 2 Table 2 New samples for cross-validation. 2.3. Methods 2.3.1. Blood Collection, RNA Isolation and RNA Quality Control Peripheral whole blood was collected from patients in EDTA Vacutainer tubes (Becton Dickinson, Franklin Lakes, NJ, USA) and PAXgene tubes (PreAnalytix, Hombrechtikon, Switzerland). Whole blood RNA was isolated as described previously [20]. Isolated RNA was checked by using 2100 Bioanalyzer RNA 6000 Nano Chips (Agilent Technologies, Santa Clara, CA, USA). Samples were excluded from microarray analysis that did not meet the following quality criteria: RIN ≥ 7.0; 28S:18S rRNA ≥ 1.0. RNA quantity was determined by absorbance at 260 nm in a DU800 Spectrophotometer (Beckman Coulter, Brea, CA, USA) or a NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Scientific, Wilmington, DE, USA). 2.3.2. Microarray Hybridization and Probe Set Quality We used the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray (Affymetrix) and the FDA-cleared, CE-IVD marked Affymetrix Gene Profiling Array cGMP U133 P2 microarray (Affymetrix) for this study. Expression data was extracted by using the MAS5 analysis method, as each microarray needs to be evaluated independently. This technique is directed towards allowing single sample predictions, which are more practical in a clinical setting. We followed the MAQC list for the Affymetrix GenChip Human Genome U133 Plus 2.0 microarray and ignored probe sets that are not included in the list of MAQC for Affymetrix microarrays [21]. We then conducted an experiment to identify probe sets which exhibit stable results across 7 different chip lots using 4 replicates each, for a total of 28 hybridizations on a pooled mRNA sample extracted from whole blood (Table 3). Probe sets were ranked according to observed variability across the 28 hybridizations. microarrays-04-00671-t003_Table 3 Table 3 Four replicates across seven microarray lots. These findings were confirmed by two individual non-pooled samples, each run twice (Table 4). For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. microarrays-04-00671-t004_Table 4 Table 4 Non-pooled samples replicates. For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. Probe sets with expression levels lower than 100 were classified as too “noisy” based on the data from these repeated hybridizations, whereas those with expression levels greater than 10,000 were classified as “saturated” and unreliable for detecting change in expression (Figure 1). The probe sets must also belong on the validated list published by the MAQC study as well as verified to be repeatable on our own EDTA and PAXgene replicate experiments. Finally, any outliers are also excluded because of their uncharacteristic expression value. These steps are summarized in Table 5. Figure 1 Technical Replicate Hybridization. (a,b) Correlation between two replicates hybridizations of samples N23C and N82B; (c) distribution of replicate ratios for probe sets within the “filtered” list. microarrays-04-00671-t005_Table 5 Table 5 Probe Set Filtering. 2.3.3. Pairs of Genes Pairs of genes are the minimum unit for analysis when using self-normalization to suppress confounding factors, such as the diurnal cycle [22]. From the stable probe sets identified in the previous steps, pairs of genes were evaluated as ratios. A pair with an AUROC value of 0.7 or higher was classified as “significant”, and the pair was set aside as a candidate biomarker pair for further combination analysis. This AUROC value was selected from empirical experience as a good balance between potentially excluding valid probe sets and accepting those that would eventually prove to be unreliable in further evaluations. Pairs using the sum of the gene expression were also set aside as candidates if the pair had complementary noise which was reduced by the sum, as evidenced by a significant increase in AUROC above that of the individual genes. Additionally, if a pair had little correlation with the disease under study (AUROC~0.5), but showed good correlation with the significant pairs, then the pair was also set aside as a potential “suppressor” pair [23]. Finally, combinations of candidate biomarker and suppressor pairs were evaluated by AUROC and a short list was selected for validation on a test set or by multiple iterations of n-fold cross-validation. The concept of using pairs of genes is similar to the practice of using differential signals in electrocardiogram (EKG) and electroencephalogram (EEG) measurements (Figure 2). The desired signal is obscured by the electrical noise from both the external environment and spurious muscle contractions in the body of the patient. However, by selecting appropriate reference points to obtain a “suppressor” signal, it is possible to optimally recover the desired signal. Figure 2 EKG pair is “suppressor” used with “raw” EEG pair to obtain a clean EEG signal with reduced EKG artefacts. Specifically selected raw signals which seem to be useless noise, are combined to suppress masking noise, revealing the underlying useful information that was always present. 2.3.4. One-Against-All (Orthogonality) One of the difficulties of using whole blood is that its transcriptome contains information reflecting not only many diseases but also all kinds of other confounding factors. However, this breadth of information is also the key to the solution of the problem. Since so many factors are mirrored in the blood transcriptome, it should be possible to find a signature for each, and reduce the problem to a set of “independent” or orthogonal equations for which the solution becomes nearly trivial. This solution is based on our hypothesis that, whereas many genes are affected by more than one disease or condition, there may exist combinations of genes that are affected only by a single condition. By setting up an analysis to look for only those combinations of genes that respond to a single condition in the explicit presence of confounding factors such as other diseases, we will be able to identify those genes that match the independent equation case. One beneficial side effect of this solution is that sample acquisition becomes much simpler; it is not necessary to find samples from patients with two or more diseases or conditions of interest. The practical consequence is not trivial: for diseases with very low prevalence rates, patients with multiple disease combinations would be vanishingly rare and impossible to acquire. We assign the “other” samples to the “not-this-disease” group and take advantage of the relatively larger numbers to attenuate any “out of the ordinary” characteristics of an individual sample. As the number increases, the potential skew from any one individual is diluted. Additionally, the gene panel is trained to reject the signature of all other diseases included in the “other” samples. This is the “one-against-all” approach for analysing gene expression profiles that makes each prediction panel more specific to the target disease condition. 2.3.5. Group Balance A side effect of employing the one-against-all approach is that clinical information about a patient in a research study is usually limited to the condition being studied. For instance, in a colorectal cancer study, all the patients under study will be endoscopically examined and determined to either have colorectal cancer (case) or to be free from colorectal cancer (control). However, it is not usually possible to know for certain whether patients from other studies are truly free from colorectal cancer. Researchers can only assume that it would be unlikely for these patients to have colorectal cancer, which is a disease with a low prevalence (<1%). The problem is that there are many more of these patients with unconfirmed diagnoses as compared with the colonoscopy-verified cancer-free patients. It might be possible to incorrectly predict all confirmed colorectal cancer negative patients as false positive and still achieve a very high specificity by correctly predicting the samples from other studies as colorectal cancer-free. To account for such bias, statisticians weigh more heavily the relative contribution of high-quality data (verified pathology) relative to the larger amount of low-quality data (assumed colorectal cancer-negative pathology). In our approach, this accounting for bias is achieved by replicating the samples that need to be weighted more heavily. We chose replication because it has the added benefit that we can introduce a controlled amount of random Gaussian noise to simulate the effect of measurement uncertainty and reduce the impact of any single data point that might skew the results. To increase the contribution of the confirmed non-cancer cases, we replicated each PAXgene HCC sample 15 times and each EDTA HCC sample 20 times, to balance them with each other and with the samples from other studies. (Table 6). microarrays-04-00671-t006_Table 6 Table 6 Training Set Group Balance. The two horizontal red arrows indicate the balance between PAXgene and EDTA collected data, the in-between vertical green arrow indicates the balance between HCC and Control subgroups. The two vertical green arrows indicate the balance between Cancer and Control/Other subgroups 2.3.6. Search Speed Optimization However, the solution proposed above has the consequence of increasing both the size of the data set and the number of potential combinations to be evaluated to the point that it becomes too time-consuming to conduct a systematic search. Since the goal is only to find some combinations of genes that predict disease well enough to be useful, we used a Monte-Carlo approach to accelerate the search. The efficiency of a random Monte-Carlo evaluation can be illustrated by comparing the numerical estimation of the value of π with and without Monte-Carlo acceleration. The mathematical constant, π, represents the ratio of the area within a circle to the square of its radius. For a circle of unit radius with an enclosed area of π units squared, the enclosing square has sides of 2 units with an area of 4 units squared. A systematic evaluation would divide the enclosing square into a grid with regularly spaced points and determine whether each of these points is within the circle or outside it. The ratio of the number of points within the circle to the number of points inside the square is an approximation of π/4. A random search would select points at random rather than systematically march along the grid. As illustrated in Figure 3, the Monte-Carlo estimate comes within 5% of the correct value much more rapidly than the systematic evaluation. 2.3.7. New Samples and Cross-Validation The method was applied to a group of blood samples from ten different diseases and controls. We used the entire data set of ten diseases and controls to identify gene expression biomarkers associated with each of the ten diseases using the statistical method described above (one-against-all). This is the first data set collected entirely in PAXgene tubes. Because the sample numbers are still small, it is not practical to evaluate prediction performance by partition into a traditional training/test sets. Instead, we performed a 2-fold cross-validation iterated 1000 times. Figure 3 Estimating the value of π numerically. (a) Convergence speed using a systematic search (b) and a Monte-Carlo random search (c). 3.
Title	2. Experimental Section
Section	2.1. Demonstration of EDTA and PAXgene Collection Tube Bias Suppression For the initial part of the study, we used blood samples that we collected for a liver cancer (hepatocellular cancer) study. The samples comprised blood taken from hepatocellular cancer (HCC) patients, chronic Hepatitis B (HpB) patients and healthy controls, all recruited in Malaysia under approved protocols [19]. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki. Our training set consisted of HCC patients: 26 HCC collected in PAXgene tubes and 20 HCC collected in EDTA tubes. We also had 28 blood samples taken from patients with chronic hepatitis B (HpB) infection collected in PAXgene tubes, 28 confirmed HCC-negative (control) samples also in PAXgene tubes and seven control samples collected in EDTA tubes. In addition, 830 samples from other studies (“other”) were collected in EDTA tubes. These “other” samples were assumed to be negative for HCC because it is a low-prevalence disease. The test set consisted of independent samples: 25 HCC collected in PAXgene tubes, 15 HCC collected in EDTA tubes, 27 HpB collected in PAXgene tubes, 27 controls collected in PAXgene tubes, 7 controls collected in EDTA tubes and 860 “other” samples collected in EDTA tubes (Table 1). microarrays-04-00671-t001_Table 1 Table 1 Breakdown of samples for collection tube bias demonstration. 2
Title	2.1. Demonstration of EDTA and PAXgene Collection Tube Bias Suppression
Table caption	microarrays-04-00671-t001_Table 1 Table 1 Breakdown of samples for collection tube bias demonstration.
Section	2.2. New Samples and Cross-Validation After we demonstrated that this method works well in separating the three groups of liver cancer study samples (HCC, HpB, control), we proceeded to apply the same method to a new set of 157 samples representing ten diseases and healthy controls collected and processed in Penang and Shanghai (Table 2). microarrays-04-00671-t002_Table 2 Table 2 New samples for cross-validation. 2
Title	2.2. New Samples and Cross-Validation
Table caption	microarrays-04-00671-t002_Table 2 Table 2 New samples for cross-validation.
Section	2.3. Methods 2.3.1. Blood Collection, RNA Isolation and RNA Quality Control Peripheral whole blood was collected from patients in EDTA Vacutainer tubes (Becton Dickinson, Franklin Lakes, NJ, USA) and PAXgene tubes (PreAnalytix, Hombrechtikon, Switzerland). Whole blood RNA was isolated as described previously [20]. Isolated RNA was checked by using 2100 Bioanalyzer RNA 6000 Nano Chips (Agilent Technologies, Santa Clara, CA, USA). Samples were excluded from microarray analysis that did not meet the following quality criteria: RIN ≥ 7.0; 28S:18S rRNA ≥ 1.0. RNA quantity was determined by absorbance at 260 nm in a DU800 Spectrophotometer (Beckman Coulter, Brea, CA, USA) or a NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Scientific, Wilmington, DE, USA). 2.3.2. Microarray Hybridization and Probe Set Quality We used the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray (Affymetrix) and the FDA-cleared, CE-IVD marked Affymetrix Gene Profiling Array cGMP U133 P2 microarray (Affymetrix) for this study. Expression data was extracted by using the MAS5 analysis method, as each microarray needs to be evaluated independently. This technique is directed towards allowing single sample predictions, which are more practical in a clinical setting. We followed the MAQC list for the Affymetrix GenChip Human Genome U133 Plus 2.0 microarray and ignored probe sets that are not included in the list of MAQC for Affymetrix microarrays [21]. We then conducted an experiment to identify probe sets which exhibit stable results across 7 different chip lots using 4 replicates each, for a total of 28 hybridizations on a pooled mRNA sample extracted from whole blood (Table 3). Probe sets were ranked according to observed variability across the 28 hybridizations. microarrays-04-00671-t003_Table 3 Table 3 Four replicates across seven microarray lots. These findings were confirmed by two individual non-pooled samples, each run twice (Table 4). For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. microarrays-04-00671-t004_Table 4 Table 4 Non-pooled samples replicates. For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. Probe sets with expression levels lower than 100 were classified as too “noisy” based on the data from these repeated hybridizations, whereas those with expression levels greater than 10,000 were classified as “saturated” and unreliable for detecting change in expression (Figure 1). The probe sets must also belong on the validated list published by the MAQC study as well as verified to be repeatable on our own EDTA and PAXgene replicate experiments. Finally, any outliers are also excluded because of their uncharacteristic expression value. These steps are summarized in Table 5. Figure 1 Technical Replicate Hybridization. (a,b) Correlation between two replicates hybridizations of samples N23C and N82B; (c) distribution of replicate ratios for probe sets within the “filtered” list. microarrays-04-00671-t005_Table 5 Table 5 Probe Set Filtering. 2.3.3. Pairs of Genes Pairs of genes are the minimum unit for analysis when using self-normalization to suppress confounding factors, such as the diurnal cycle [22]. From the stable probe sets identified in the previous steps, pairs of genes were evaluated as ratios. A pair with an AUROC value of 0.7 or higher was classified as “significant”, and the pair was set aside as a candidate biomarker pair for further combination analysis. This AUROC value was selected from empirical experience as a good balance between potentially excluding valid probe sets and accepting those that would eventually prove to be unreliable in further evaluations. Pairs using the sum of the gene expression were also set aside as candidates if the pair had complementary noise which was reduced by the sum, as evidenced by a significant increase in AUROC above that of the individual genes. Additionally, if a pair had little correlation with the disease under study (AUROC~0.5), but showed good correlation with the significant pairs, then the pair was also set aside as a potential “suppressor” pair [23]. Finally, combinations of candidate biomarker and suppressor pairs were evaluated by AUROC and a short list was selected for validation on a test set or by multiple iterations of n-fold cross-validation. The concept of using pairs of genes is similar to the practice of using differential signals in electrocardiogram (EKG) and electroencephalogram (EEG) measurements (Figure 2). The desired signal is obscured by the electrical noise from both the external environment and spurious muscle contractions in the body of the patient. However, by selecting appropriate reference points to obtain a “suppressor” signal, it is possible to optimally recover the desired signal. Figure 2 EKG pair is “suppressor” used with “raw” EEG pair to obtain a clean EEG signal with reduced EKG artefacts. Specifically selected raw signals which seem to be useless noise, are combined to suppress masking noise, revealing the underlying useful information that was always present. 2.3.4. One-Against-All (Orthogonality) One of the difficulties of using whole blood is that its transcriptome contains information reflecting not only many diseases but also all kinds of other confounding factors. However, this breadth of information is also the key to the solution of the problem. Since so many factors are mirrored in the blood transcriptome, it should be possible to find a signature for each, and reduce the problem to a set of “independent” or orthogonal equations for which the solution becomes nearly trivial. This solution is based on our hypothesis that, whereas many genes are affected by more than one disease or condition, there may exist combinations of genes that are affected only by a single condition. By setting up an analysis to look for only those combinations of genes that respond to a single condition in the explicit presence of confounding factors such as other diseases, we will be able to identify those genes that match the independent equation case. One beneficial side effect of this solution is that sample acquisition becomes much simpler; it is not necessary to find samples from patients with two or more diseases or conditions of interest. The practical consequence is not trivial: for diseases with very low prevalence rates, patients with multiple disease combinations would be vanishingly rare and impossible to acquire. We assign the “other” samples to the “not-this-disease” group and take advantage of the relatively larger numbers to attenuate any “out of the ordinary” characteristics of an individual sample. As the number increases, the potential skew from any one individual is diluted. Additionally, the gene panel is trained to reject the signature of all other diseases included in the “other” samples. This is the “one-against-all” approach for analysing gene expression profiles that makes each prediction panel more specific to the target disease condition. 2.3.5. Group Balance A side effect of employing the one-against-all approach is that clinical information about a patient in a research study is usually limited to the condition being studied. For instance, in a colorectal cancer study, all the patients under study will be endoscopically examined and determined to either have colorectal cancer (case) or to be free from colorectal cancer (control). However, it is not usually possible to know for certain whether patients from other studies are truly free from colorectal cancer. Researchers can only assume that it would be unlikely for these patients to have colorectal cancer, which is a disease with a low prevalence (<1%). The problem is that there are many more of these patients with unconfirmed diagnoses as compared with the colonoscopy-verified cancer-free patients. It might be possible to incorrectly predict all confirmed colorectal cancer negative patients as false positive and still achieve a very high specificity by correctly predicting the samples from other studies as colorectal cancer-free. To account for such bias, statisticians weigh more heavily the relative contribution of high-quality data (verified pathology) relative to the larger amount of low-quality data (assumed colorectal cancer-negative pathology). In our approach, this accounting for bias is achieved by replicating the samples that need to be weighted more heavily. We chose replication because it has the added benefit that we can introduce a controlled amount of random Gaussian noise to simulate the effect of measurement uncertainty and reduce the impact of any single data point that might skew the results. To increase the contribution of the confirmed non-cancer cases, we replicated each PAXgene HCC sample 15 times and each EDTA HCC sample 20 times, to balance them with each other and with the samples from other studies. (Table 6). microarrays-04-00671-t006_Table 6 Table 6 Training Set Group Balance. The two horizontal red arrows indicate the balance between PAXgene and EDTA collected data, the in-between vertical green arrow indicates the balance between HCC and Control subgroups. The two vertical green arrows indicate the balance between Cancer and Control/Other subgroups 2.3.6. Search Speed Optimization However, the solution proposed above has the consequence of increasing both the size of the data set and the number of potential combinations to be evaluated to the point that it becomes too time-consuming to conduct a systematic search. Since the goal is only to find some combinations of genes that predict disease well enough to be useful, we used a Monte-Carlo approach to accelerate the search. The efficiency of a random Monte-Carlo evaluation can be illustrated by comparing the numerical estimation of the value of π with and without Monte-Carlo acceleration. The mathematical constant, π, represents the ratio of the area within a circle to the square of its radius. For a circle of unit radius with an enclosed area of π units squared, the enclosing square has sides of 2 units with an area of 4 units squared. A systematic evaluation would divide the enclosing square into a grid with regularly spaced points and determine whether each of these points is within the circle or outside it. The ratio of the number of points within the circle to the number of points inside the square is an approximation of π/4. A random search would select points at random rather than systematically march along the grid. As illustrated in Figure 3, the Monte-Carlo estimate comes within 5% of the correct value much more rapidly than the systematic evaluation. 2.3.7. New Samples and Cross-Validation The method was applied to a group of blood samples from ten different diseases and controls. We used the entire data set of ten diseases and controls to identify gene expression biomarkers associated with each of the ten diseases using the statistical method described above (one-against-all). This is the first data set collected entirely in PAXgene tubes. Because the sample numbers are still small, it is not practical to evaluate prediction performance by partition into a traditional training/test sets. Instead, we performed a 2-fold cross-validation iterated 1000 times. Figure 3 Estimating the value of π numerically. (a) Convergence speed using a systematic search (b) and a Monte-Carlo random search (c). 3.
Title	2.3. Methods
Section	2.3.1. Blood Collection, RNA Isolation and RNA Quality Control Peripheral whole blood was collected from patients in EDTA Vacutainer tubes (Becton Dickinson, Franklin Lakes, NJ, USA) and PAXgene tubes (PreAnalytix, Hombrechtikon, Switzerland). Whole blood RNA was isolated as described previously [20]. Isolated RNA was checked by using 2100 Bioanalyzer RNA 6000 Nano Chips (Agilent Technologies, Santa Clara, CA, USA). Samples were excluded from microarray analysis that did not meet the following quality criteria: RIN ≥ 7.0; 28S:18S rRNA ≥ 1.0. RNA quantity was determined by absorbance at 260 nm in a DU800 Spectrophotometer (Beckman Coulter, Brea, CA, USA) or a NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Scientific, Wilmington, DE, USA).
Title	2.3.1. Blood Collection, RNA Isolation and RNA Quality Control
Section	2.3.2. Microarray Hybridization and Probe Set Quality We used the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray (Affymetrix) and the FDA-cleared, CE-IVD marked Affymetrix Gene Profiling Array cGMP U133 P2 microarray (Affymetrix) for this study. Expression data was extracted by using the MAS5 analysis method, as each microarray needs to be evaluated independently. This technique is directed towards allowing single sample predictions, which are more practical in a clinical setting. We followed the MAQC list for the Affymetrix GenChip Human Genome U133 Plus 2.0 microarray and ignored probe sets that are not included in the list of MAQC for Affymetrix microarrays [21]. We then conducted an experiment to identify probe sets which exhibit stable results across 7 different chip lots using 4 replicates each, for a total of 28 hybridizations on a pooled mRNA sample extracted from whole blood (Table 3). Probe sets were ranked according to observed variability across the 28 hybridizations. microarrays-04-00671-t003_Table 3 Table 3 Four replicates across seven microarray lots. These findings were confirmed by two individual non-pooled samples, each run twice (Table 4). For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. microarrays-04-00671-t004_Table 4 Table 4 Non-pooled samples replicates. For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. Probe sets with expression levels lower than 100 were classified as too “noisy” based on the data from these repeated hybridizations, whereas those with expression levels greater than 10,000 were classified as “saturated” and unreliable for detecting change in expression (Figure 1). The probe sets must also belong on the validated list published by the MAQC study as well as verified to be repeatable on our own EDTA and PAXgene replicate experiments. Finally, any outliers are also excluded because of their uncharacteristic expression value. These steps are summarized in Table 5. Figure 1 Technical Replicate Hybridization. (a,b) Correlation between two replicates hybridizations of samples N23C and N82B; (c) distribution of replicate ratios for probe sets within the “filtered” list. microarrays-04-00671-t005_Table 5 Table 5 Probe Set Filtering. 2
Title	2.3.2. Microarray Hybridization and Probe Set Quality
Table caption	microarrays-04-00671-t003_Table 3 Table 3 Four replicates across seven microarray lots. These findings were confirmed by two individual non-pooled samples, each run twice (Table 4). For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. m
Table caption	microarrays-04-00671-t004_Table 4 Table 4 Non-pooled samples replicates. For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays.
Figure caption	Figure 1 Technical Replicate Hybridization. (a,b) Correlation between two replicates hybridizations of samples N23C and N82B; (c) distribution of replicate ratios for probe sets within the “filtered” list. m
Table caption	microarrays-04-00671-t005_Table 5 Table 5 Probe Set Filtering.
Section	2.3.3. Pairs of Genes Pairs of genes are the minimum unit for analysis when using self-normalization to suppress confounding factors, such as the diurnal cycle [22]. From the stable probe sets identified in the previous steps, pairs of genes were evaluated as ratios. A pair with an AUROC value of 0.7 or higher was classified as “significant”, and the pair was set aside as a candidate biomarker pair for further combination analysis. This AUROC value was selected from empirical experience as a good balance between potentially excluding valid probe sets and accepting those that would eventually prove to be unreliable in further evaluations. Pairs using the sum of the gene expression were also set aside as candidates if the pair had complementary noise which was reduced by the sum, as evidenced by a significant increase in AUROC above that of the individual genes. Additionally, if a pair had little correlation with the disease under study (AUROC~0.5), but showed good correlation with the significant pairs, then the pair was also set aside as a potential “suppressor” pair [23]. Finally, combinations of candidate biomarker and suppressor pairs were evaluated by AUROC and a short list was selected for validation on a test set or by multiple iterations of n-fold cross-validation. The concept of using pairs of genes is similar to the practice of using differential signals in electrocardiogram (EKG) and electroencephalogram (EEG) measurements (Figure 2). The desired signal is obscured by the electrical noise from both the external environment and spurious muscle contractions in the body of the patient. However, by selecting appropriate reference points to obtain a “suppressor” signal, it is possible to optimally recover the desired signal. Figure 2 EKG pair is “suppressor” used with “raw” EEG pair to obtain a clean EEG signal with reduced EKG artefacts. Specifically selected raw signals which seem to be useless noise, are combined to suppress masking noise, revealing the underlying useful information that was always present. 2
Title	2.3.3. Pairs of Genes
Figure caption	Figure 2 EKG pair is “suppressor” used with “raw” EEG pair to obtain a clean EEG signal with reduced EKG artefacts. Specifically selected raw signals which seem to be useless noise, are combined to suppress masking noise, revealing the underlying useful information that was always present.
Section	2.3.4. One-Against-All (Orthogonality) One of the difficulties of using whole blood is that its transcriptome contains information reflecting not only many diseases but also all kinds of other confounding factors. However, this breadth of information is also the key to the solution of the problem. Since so many factors are mirrored in the blood transcriptome, it should be possible to find a signature for each, and reduce the problem to a set of “independent” or orthogonal equations for which the solution becomes nearly trivial. This solution is based on our hypothesis that, whereas many genes are affected by more than one disease or condition, there may exist combinations of genes that are affected only by a single condition. By setting up an analysis to look for only those combinations of genes that respond to a single condition in the explicit presence of confounding factors such as other diseases, we will be able to identify those genes that match the independent equation case. One beneficial side effect of this solution is that sample acquisition becomes much simpler; it is not necessary to find samples from patients with two or more diseases or conditions of interest. The practical consequence is not trivial: for diseases with very low prevalence rates, patients with multiple disease combinations would be vanishingly rare and impossible to acquire. We assign the “other” samples to the “not-this-disease” group and take advantage of the relatively larger numbers to attenuate any “out of the ordinary” characteristics of an individual sample. As the number increases, the potential skew from any one individual is diluted. Additionally, the gene panel is trained to reject the signature of all other diseases included in the “other” samples. This is the “one-against-all” approach for analysing gene expression profiles that makes each prediction panel more specific to the target disease condition.
Title	2.3.4. One-Against-All (Orthogonality)
Section	2.3.5. Group Balance A side effect of employing the one-against-all approach is that clinical information about a patient in a research study is usually limited to the condition being studied. For instance, in a colorectal cancer study, all the patients under study will be endoscopically examined and determined to either have colorectal cancer (case) or to be free from colorectal cancer (control). However, it is not usually possible to know for certain whether patients from other studies are truly free from colorectal cancer. Researchers can only assume that it would be unlikely for these patients to have colorectal cancer, which is a disease with a low prevalence (<1%). The problem is that there are many more of these patients with unconfirmed diagnoses as compared with the colonoscopy-verified cancer-free patients. It might be possible to incorrectly predict all confirmed colorectal cancer negative patients as false positive and still achieve a very high specificity by correctly predicting the samples from other studies as colorectal cancer-free. To account for such bias, statisticians weigh more heavily the relative contribution of high-quality data (verified pathology) relative to the larger amount of low-quality data (assumed colorectal cancer-negative pathology). In our approach, this accounting for bias is achieved by replicating the samples that need to be weighted more heavily. We chose replication because it has the added benefit that we can introduce a controlled amount of random Gaussian noise to simulate the effect of measurement uncertainty and reduce the impact of any single data point that might skew the results. To increase the contribution of the confirmed non-cancer cases, we replicated each PAXgene HCC sample 15 times and each EDTA HCC sample 20 times, to balance them with each other and with the samples from other studies. (Table 6). microarrays-04-00671-t006_Table 6 Table 6 Training Set Group Balance. The two horizontal red arrows indicate the balance between PAXgene and EDTA collected data, the in-between vertical green arrow indicates the balance between HCC and Control subgroups. The two vertical green arrows indicate the balance between Cancer and Control/Other subgroups 2
Title	2.3.5. Group Balance
Table caption	microarrays-04-00671-t006_Table 6 Table 6 Training Set Group Balance. The two horizontal red arrows indicate the balance between PAXgene and EDTA collected data, the in-between vertical green arrow indicates the balance between HCC and Control subgroups. The two vertical green arrows indicate the balance between Cancer and Control/Other subgroups
Section	2.3.6. Search Speed Optimization However, the solution proposed above has the consequence of increasing both the size of the data set and the number of potential combinations to be evaluated to the point that it becomes too time-consuming to conduct a systematic search. Since the goal is only to find some combinations of genes that predict disease well enough to be useful, we used a Monte-Carlo approach to accelerate the search. The efficiency of a random Monte-Carlo evaluation can be illustrated by comparing the numerical estimation of the value of π with and without Monte-Carlo acceleration. The mathematical constant, π, represents the ratio of the area within a circle to the square of its radius. For a circle of unit radius with an enclosed area of π units squared, the enclosing square has sides of 2 units with an area of 4 units squared. A systematic evaluation would divide the enclosing square into a grid with regularly spaced points and determine whether each of these points is within the circle or outside it. The ratio of the number of points within the circle to the number of points inside the square is an approximation of π/4. A random search would select points at random rather than systematically march along the grid. As illustrated in Figure 3, the Monte-Carlo estimate comes within 5% of the correct value much more rapidly than the systematic evaluation.
Title	2.3.6. Search Speed Optimization
Section	2.3.7. New Samples and Cross-Validation The method was applied to a group of blood samples from ten different diseases and controls. We used the entire data set of ten diseases and controls to identify gene expression biomarkers associated with each of the ten diseases using the statistical method described above (one-against-all). This is the first data set collected entirely in PAXgene tubes. Because the sample numbers are still small, it is not practical to evaluate prediction performance by partition into a traditional training/test sets. Instead, we performed a 2-fold cross-validation iterated 1000 times. Figure 3 Estimating the value of π numerically. (a) Convergence speed using a systematic search (b) and a Monte-Carlo random search (c). 3
Title	2.3.7. New Samples and Cross-Validation
Figure caption	Figure 3 Estimating the value of π numerically. (a) Convergence speed using a systematic search (b) and a Monte-Carlo random search (c).
Section	3. Results and Discussion We applied these procedures to a set of samples that had the complication that the samples had been collected using two different technologies: EDTA and PAXgene tubes. Although the raw expression levels exhibited differences between these two sets, using our strategy it was possible to find combinations of pairs that exhibited improved stability and were predictive of the underlying disease. The search identified a combination of 12 probe sets in six pairs from the filtered list of 586 candidates. This combination scored the samples with fairly good overlap between the PAXgene and EDTA samples in each disease category in both the training and independent test sets (Figure 4). For comparison we also plotted the predictions using a standard open-source statistical package, Weka. This analysis was conducted using the SimpleLogistic classifier function with default parameters employing all available 9163 probe set data without any filtering or weighting. The SimpleLogistic classifier has a built-in feature selection capability that selected the final set of 17 probe sets from the entire data (WekaRaw17Gene). This comparison using an unmodified version of Weka under default settings is not intended to critique Weka, but rather to highlight the effect of the additional steps described in this manuscript. Figure 4 Prediction scores using the method described in this paper (LogReg_6Pairs) Weka prediction using all data without any preprocessing (Weka Raw17Gene). The test set results from the WekaRaw17Gene panel trained without sample weighting show that the liver cancer samples collected in EDTA tubes (median = −0.50) have dropped in prediction scores relative to the PAXgene cancer samples (median = +0.28) and are now in the same range as chronic hepatitis B (median = −0.74) and control patients (median = −0.94). By contrast, the predictions using our method aligns both liver cancer groups (EDTA median = +0.80, PAXgene median = +0.77) which are well separated from the HpB samples (median = −0.38) and control samples (EDTA median = −1.34, PAXgene median = −0.57). These results illustrate that the novel statistical method described above worked well in separating the three groups of liver cancer study samples (HCC, HpB, control), even under different conditions, such as chip lot or blood collection tubes. Additionally, the loss of test set prediction accuracy resulting from the absence of the PAXgene HpB and other subgroups of samples can be overcome by using the information from other available PAXgene subgroups, as shown in the results presented in Figure 4. We then applied the same procedures to a group of blood samples from ten different diseases and controls collected in PAXgene tubes (Table 2). The results of 1000 iterations of 2-fold cross-validation for each disease’s gene pair panel are summarised in Table 7. When this method was used, all ten disease gene pair panels showed consistently high prediction performance: high sensitivity (mean 89%), high specificity for the healthy controls (mean 98%), high specificity for the other nine diseases (93%), and high AUROC (mean 96%). The prediction results for each individual subject for the risk of colorectal cancer are charted in Figure 5, which shows good discrimination between colorectal cancer and controls and the other nine diseases. All but one colorectal cancer subject achieved a score above the threshold value of 0 while only four of all the other subjects returned a false positive prediction. The other gene panels achieved similar results. Figure 6 is an example of the prediction of the risk of ten different diseases for a single liver cancer patient using the gene pair panels obtained using the method described in this paper. This liver cancer patient can be seen to be at high risk for liver cancer and at no higher than population average risk for the other nine diseases. microarrays-04-00671-t007_Table 7 Table 7 Performance of gene pair panels of each disease using our statistical method (1000 iterations of 2-fold cross validation). Figure 5 Prediction of risk for colorectal cancer for individual subjects using the colorectal cancer gene pair panel identified by the method described in this paper. Figure 6 Prediction of the risk of 10 different diseases for an individual liver cancer patient, using the gene pair panels obtained using the method described in this paper. This patient was known to have liver cancer and had no indication of any of the other diseases being evaluated. 4
Title	3. Results and Discussion
Figure caption	Figure 4 Prediction scores using the method described in this paper (LogReg_6Pairs) Weka prediction using all data without any preprocessing (Weka Raw17Gene).
Table caption	microarrays-04-00671-t007_Table 7 Table 7 Performance of gene pair panels of each disease using our statistical method (1000 iterations of 2-fold cross validation). F
Figure caption	Figure 5 Prediction of risk for colorectal cancer for individual subjects using the colorectal cancer gene pair panel identified by the method described in this paper. F
Figure caption	Figure 6 Prediction of the risk of 10 different diseases for an individual liver cancer patient, using the gene pair panels obtained using the method described in this paper. This patient was known to have liver cancer and had no indication of any of the other diseases being evaluated.
Section	4. Conclusions We have presented a procedure that identifies a set of probe sets which demonstrate reliable expression levels for target genes. Using these, we evaluated ratio pairs to achieve self-normalization. By combining discriminative pairs and suppressor pairs, we found useful panels of gene pairs that are able to predict disease even across varying conditions such as chip lot or sample collection tube differences. For comparison, we also processed the data with a widely-used machine learning package, Weka, using the SimpleLogistic model with automatic feature selection. Weka achieved the best overall discrimination with a panel of 17 probe sets, but was unable to suppress the bias introduced by the use of two different collection tubes. That is, whereas our method managed to align the liver cancer samples so that the majority (~75%) of both EDTA and PAXgene samples are predicted as true positive in the test set, the Weka predictions (based on data using unfiltered gene lists and without sample weighting) classified nearly half the liver cancer samples collected in EDTA tubes as false negatives and nearly all of the liver cancer samples as true positives (Figure 4). Our method was then applied to a new cohort of samples collected in PAXgene tubes representing ten different diseases. The panels predicted with consistently high performance under repeated 2-fold cross-validation. We expect, based on our previous experience with EDTA data that the prediction performance will hold when the sample number is increased. It may even be possible that with increased sample numbers, the gene panels may be refined and result in improved prediction performance. With this method, the risk of a single patient having any of the ten diseases studied can be obtained simultaneously using a single blood sample. Figure 7 is a schematic representation of the process by which we identify panels of genes that are predictive of disease conditions. These panels can then be applied to the data from a single individual to make predictions of risk for these conditions. Figure 7 Schematic representation multiple disease prediction. The gene expression from a reference population representing several disease conditions is filtered according to a Quality Assurance system based on repeatability data. These data are then analysed to derive predictive model for each disease condition. These models can then be applied to the data from a new sample to make risk prediction for this individual. We present the methodology described in this paper as a minimum set of procedures optimized for noisy data with individual component performance complementary to the other components. For example, the use of seemingly non-informative genes is suggested by the fundamental concept of suppressor variables, which dates back to 1941 and is similar to the differential amplifier of 1934 [24] or even the Wheatstone bridge of 1843 [25] and has been successfully applied in other areas of science and technology but appears to have been generally neglected by the community of scientists involved in genomic data analysis. We are convinced that a return to a more balanced holistic approach to data analysis may help in extracting useful information from the mass of data which can be obtained by rapidly advancing modern technology. The circulating peripheral blood system is involved in the regulation, coordination, metabolism and immune maintenance of all cells, tissues and organs. Functions of blood include transporting nutrients, oxygen and biomolecules, and removing cellular waste. Blood is further involved in immune surveillance throughout the body, and delivery of immune factors and mediators to sites of disease, infection and injury. Thus, the circulation and physiologically interactive nature of blood ensure that this system encounters, transmits, and is affected by a wide range of biological signals. Over the past decade, we have investigated the blood genetic signatures of a wide variety of diseases and conditions affecting numerous organs and functions, including psychiatric disorders, osteoarthritis, cardiovascular disease, gastrointestinal diseases, and cancer. It has been found that each disease has its characteristic expression spectrum of genetic signatures in peripheral blood, which make it possible to detect disease anywhere in the body by accessing subtle changes in blood RNA. Based on these studies, we address the Sentinel Principle® that views the circulating blood cells as “sentinels” for detecting and responding to micro-environmental changes in the body. Accordingly, the current state of health or disease of an organism is conveyed in the blood through interactions between circulating blood cells and the body’s cells, tissues and organs. Since blood samples can be readily obtained non-invasively, the genetic signature derived from blood RNA provides an alternative to tissue biopsy for determining the diagnosis and prognosis of many different diseases. Overcoming the problems of blood-based transcriptomics discussed above will further extend the application of the Sentinel Principle not only to the diagnosis of multiple diseases in one blood sample, but also to other fields of personalized medicine such as active surveillance, prognosis, drug response, and so on.
Title	4. Conclusions
Figure caption	Figure 7 Schematic representation multiple disease prediction. The gene expression from a reference population representing several disease conditions is filtered according to a Quality Assurance system based on repeatability data. These data are then analysed to derive predictive model for each disease condition. These models can then be applied to the data from a new sample to make risk prediction for this individual.

projects that have annotations to this span

There is no project

TAB JSON ListView MergeView

PMC:4996407 / 17441-17649 JSONTXT

Document structure show

projects that have annotations to this span

PMC:4996407 / 17441-17649 JSON TXT