
PMC:4996407 / 11021-26084
Annnotations
2_test
{"project":"2_test","denotations":[{"id":"27600246-25569223-69478671","span":{"begin":2093,"end":2095},"obj":"25569223"},{"id":"27600246-19795455-69478672","span":{"begin":3914,"end":3916},"obj":"19795455"},{"id":"27600246-16964229-69478673","span":{"begin":5047,"end":5049},"obj":"16964229"},{"id":"27600246-19926881-69478674","span":{"begin":6853,"end":6855},"obj":"19926881"}],"text":"2. Experimental Section\nIn an earlier study we published the predictive performance of gene panels identified by the method developed by our group using a data set consisting of 631 blood samples collected in EDTA tubes and discriminating for three diseases with an area under the receiving operator characteristic curve (AUROC) ranging from 89% to 93% [17]. We later expanded the set to include 17 diseases represented by more than 1700 samples, and we were able to obtain similar prediction performance. However, because our samples were collected using EDTA tubes, the results would not be useful for many clinical applications, as discussed above.\nWe have therefore transitioned to PAXgene tube collection and have started to rebuild our disease panels. This offered us an opportunity to evaluate the performance of our method against an established statistical package, Weka [18], for a study in which samples were collected initially using EDTA tubes and then PAXgene tubes.\nIn this two-part study, we first present details of our method and demonstrate its ability to overcome a known and documented bias between samples collected in EDTA and PAXgene tubes. The PAXgene samples were processed using Nugen reagent kit (NuGen, San Carlos, CA, USA). By extension, this method should be able to overcome other biases, which may not be known in advance. Then, we present the results of this method applied to a new cohort of samples representing ten cancers and diseases collected in PAXgene tubes and processed using the current 3′ IVT PLUS reagent kit (Affymetrix, Santa Clara, CA, USA). We switched from the Nugen kit because the new reagent kit has better repeatability performance in our laboratories.\n\n2.1. Demonstration of EDTA and PAXgene Collection Tube Bias Suppression\nFor the initial part of the study, we used blood samples that we collected for a liver cancer (hepatocellular cancer) study. The samples comprised blood taken from hepatocellular cancer (HCC) patients, chronic Hepatitis B (HpB) patients and healthy controls, all recruited in Malaysia under approved protocols [19]. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted in accordance with the Declaration of Helsinki. Our training set consisted of HCC patients: 26 HCC collected in PAXgene tubes and 20 HCC collected in EDTA tubes. We also had 28 blood samples taken from patients with chronic hepatitis B (HpB) infection collected in PAXgene tubes, 28 confirmed HCC-negative (control) samples also in PAXgene tubes and seven control samples collected in EDTA tubes. In addition, 830 samples from other studies (“other”) were collected in EDTA tubes. These “other” samples were assumed to be negative for HCC because it is a low-prevalence disease. The test set consisted of independent samples: 25 HCC collected in PAXgene tubes, 15 HCC collected in EDTA tubes, 27 HpB collected in PAXgene tubes, 27 controls collected in PAXgene tubes, 7 controls collected in EDTA tubes and 860 “other” samples collected in EDTA tubes (Table 1).\nmicroarrays-04-00671-t001_Table 1 Table 1 Breakdown of samples for collection tube bias demonstration.\n\n2.2. New Samples and Cross-Validation\nAfter we demonstrated that this method works well in separating the three groups of liver cancer study samples (HCC, HpB, control), we proceeded to apply the same method to a new set of 157 samples representing ten diseases and healthy controls collected and processed in Penang and Shanghai (Table 2).\nmicroarrays-04-00671-t002_Table 2 Table 2 New samples for cross-validation.\n\n2.3. Methods\n\n2.3.1. Blood Collection, RNA Isolation and RNA Quality Control\nPeripheral whole blood was collected from patients in EDTA Vacutainer tubes (Becton Dickinson, Franklin Lakes, NJ, USA) and PAXgene tubes (PreAnalytix, Hombrechtikon, Switzerland). Whole blood RNA was isolated as described previously [20]. Isolated RNA was checked by using 2100 Bioanalyzer RNA 6000 Nano Chips (Agilent Technologies, Santa Clara, CA, USA). Samples were excluded from microarray analysis that did not meet the following quality criteria: RIN ≥ 7.0; 28S:18S rRNA ≥ 1.0. RNA quantity was determined by absorbance at 260 nm in a DU800 Spectrophotometer (Beckman Coulter, Brea, CA, USA) or a NanoDrop 2000c UV-Vis Spectrophotometer (Thermo Scientific, Wilmington, DE, USA).\n\n2.3.2. Microarray Hybridization and Probe Set Quality\nWe used the Affymetrix GeneChip Human Genome U133 Plus 2.0 microarray (Affymetrix) and the FDA-cleared, CE-IVD marked Affymetrix Gene Profiling Array cGMP U133 P2 microarray (Affymetrix) for this study. Expression data was extracted by using the MAS5 analysis method, as each microarray needs to be evaluated independently. This technique is directed towards allowing single sample predictions, which are more practical in a clinical setting.\nWe followed the MAQC list for the Affymetrix GenChip Human Genome U133 Plus 2.0 microarray and ignored probe sets that are not included in the list of MAQC for Affymetrix microarrays [21].\nWe then conducted an experiment to identify probe sets which exhibit stable results across 7 different chip lots using 4 replicates each, for a total of 28 hybridizations on a pooled mRNA sample extracted from whole blood (Table 3). Probe sets were ranked according to observed variability across the 28 hybridizations.\nmicroarrays-04-00671-t003_Table 3 Table 3 Four replicates across seven microarray lots. These findings were confirmed by two individual non-pooled samples, each run twice (Table 4). For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays.\nmicroarrays-04-00671-t004_Table 4 Table 4 Non-pooled samples replicates. For more details on these QC parameters, please refer to Affymetrix notes on U133Plus2 microarrays. Probe sets with expression levels lower than 100 were classified as too “noisy” based on the data from these repeated hybridizations, whereas those with expression levels greater than 10,000 were classified as “saturated” and unreliable for detecting change in expression (Figure 1). The probe sets must also belong on the validated list published by the MAQC study as well as verified to be repeatable on our own EDTA and PAXgene replicate experiments. Finally, any outliers are also excluded because of their uncharacteristic expression value. These steps are summarized in Table 5.\nFigure 1 Technical Replicate Hybridization. (a,b) Correlation between two replicates hybridizations of samples N23C and N82B; (c) distribution of replicate ratios for probe sets within the “filtered” list.\nmicroarrays-04-00671-t005_Table 5 Table 5 Probe Set Filtering.\n\n2.3.3. Pairs of Genes\nPairs of genes are the minimum unit for analysis when using self-normalization to suppress confounding factors, such as the diurnal cycle [22]. From the stable probe sets identified in the previous steps, pairs of genes were evaluated as ratios. A pair with an AUROC value of 0.7 or higher was classified as “significant”, and the pair was set aside as a candidate biomarker pair for further combination analysis. This AUROC value was selected from empirical experience as a good balance between potentially excluding valid probe sets and accepting those that would eventually prove to be unreliable in further evaluations. Pairs using the sum of the gene expression were also set aside as candidates if the pair had complementary noise which was reduced by the sum, as evidenced by a significant increase in AUROC above that of the individual genes. Additionally, if a pair had little correlation with the disease under study (AUROC~0.5), but showed good correlation with the significant pairs, then the pair was also set aside as a potential “suppressor” pair [23]. Finally, combinations of candidate biomarker and suppressor pairs were evaluated by AUROC and a short list was selected for validation on a test set or by multiple iterations of n-fold cross-validation. The concept of using pairs of genes is similar to the practice of using differential signals in electrocardiogram (EKG) and electroencephalogram (EEG) measurements (Figure 2). The desired signal is obscured by the electrical noise from both the external environment and spurious muscle contractions in the body of the patient. However, by selecting appropriate reference points to obtain a “suppressor” signal, it is possible to optimally recover the desired signal.\nFigure 2 EKG pair is “suppressor” used with “raw” EEG pair to obtain a clean EEG signal with reduced EKG artefacts. Specifically selected raw signals which seem to be useless noise, are combined to suppress masking noise, revealing the underlying useful information that was always present.\n\n2.3.4. One-Against-All (Orthogonality)\nOne of the difficulties of using whole blood is that its transcriptome contains information reflecting not only many diseases but also all kinds of other confounding factors. However, this breadth of information is also the key to the solution of the problem. Since so many factors are mirrored in the blood transcriptome, it should be possible to find a signature for each, and reduce the problem to a set of “independent” or orthogonal equations for which the solution becomes nearly trivial.\nThis solution is based on our hypothesis that, whereas many genes are affected by more than one disease or condition, there may exist combinations of genes that are affected only by a single condition. By setting up an analysis to look for only those combinations of genes that respond to a single condition in the explicit presence of confounding factors such as other diseases, we will be able to identify those genes that match the independent equation case. One beneficial side effect of this solution is that sample acquisition becomes much simpler; it is not necessary to find samples from patients with two or more diseases or conditions of interest. The practical consequence is not trivial: for diseases with very low prevalence rates, patients with multiple disease combinations would be vanishingly rare and impossible to acquire.\nWe assign the “other” samples to the “not-this-disease” group and take advantage of the relatively larger numbers to attenuate any “out of the ordinary” characteristics of an individual sample. As the number increases, the potential skew from any one individual is diluted. Additionally, the gene panel is trained to reject the signature of all other diseases included in the “other” samples. This is the “one-against-all” approach for analysing gene expression profiles that makes each prediction panel more specific to the target disease condition.\n\n2.3.5. Group Balance\nA side effect of employing the one-against-all approach is that clinical information about a patient in a research study is usually limited to the condition being studied. For instance, in a colorectal cancer study, all the patients under study will be endoscopically examined and determined to either have colorectal cancer (case) or to be free from colorectal cancer (control). However, it is not usually possible to know for certain whether patients from other studies are truly free from colorectal cancer. Researchers can only assume that it would be unlikely for these patients to have colorectal cancer, which is a disease with a low prevalence (\u003c1%). The problem is that there are many more of these patients with unconfirmed diagnoses as compared with the colonoscopy-verified cancer-free patients. It might be possible to incorrectly predict all confirmed colorectal cancer negative patients as false positive and still achieve a very high specificity by correctly predicting the samples from other studies as colorectal cancer-free. To account for such bias, statisticians weigh more heavily the relative contribution of high-quality data (verified pathology) relative to the larger amount of low-quality data (assumed colorectal cancer-negative pathology).\nIn our approach, this accounting for bias is achieved by replicating the samples that need to be weighted more heavily. We chose replication because it has the added benefit that we can introduce a controlled amount of random Gaussian noise to simulate the effect of measurement uncertainty and reduce the impact of any single data point that might skew the results. To increase the contribution of the confirmed non-cancer cases, we replicated each PAXgene HCC sample 15 times and each EDTA HCC sample 20 times, to balance them with each other and with the samples from other studies. (Table 6).\nmicroarrays-04-00671-t006_Table 6 Table 6 Training Set Group Balance. The two horizontal red arrows indicate the balance between PAXgene and EDTA collected data, the in-between vertical green arrow indicates the balance between HCC and Control subgroups. The two vertical green arrows indicate the balance between Cancer and Control/Other subgroups\n\n2.3.6. Search Speed Optimization\nHowever, the solution proposed above has the consequence of increasing both the size of the data set and the number of potential combinations to be evaluated to the point that it becomes too time-consuming to conduct a systematic search. Since the goal is only to find some combinations of genes that predict disease well enough to be useful, we used a Monte-Carlo approach to accelerate the search.\nThe efficiency of a random Monte-Carlo evaluation can be illustrated by comparing the numerical estimation of the value of π with and without Monte-Carlo acceleration. The mathematical constant, π, represents the ratio of the area within a circle to the square of its radius. For a circle of unit radius with an enclosed area of π units squared, the enclosing square has sides of 2 units with an area of 4 units squared. A systematic evaluation would divide the enclosing square into a grid with regularly spaced points and determine whether each of these points is within the circle or outside it. The ratio of the number of points within the circle to the number of points inside the square is an approximation of π/4. A random search would select points at random rather than systematically march along the grid. As illustrated in Figure 3, the Monte-Carlo estimate comes within 5% of the correct value much more rapidly than the systematic evaluation.\n\n2.3.7. New Samples and Cross-Validation\nThe method was applied to a group of blood samples from ten different diseases and controls. We used the entire data set of ten diseases and controls to identify gene expression biomarkers associated with each of the ten diseases using the statistical method described above (one-against-all). This is the first data set collected entirely in PAXgene tubes. Because the sample numbers are still small, it is not practical to evaluate prediction performance by partition into a traditional training/test sets. Instead, we performed a 2-fold cross-validation iterated 1000 times.\nFigure 3 Estimating the value of π numerically. (a) Convergence speed using a systematic search (b) and a Monte-Carlo random search (c).\n\n3. "}