3.2. Pathway Analysis A pathway enrichment analysis of the differentially expressed genes was carried out against the Reactome pathway database using the ReactomePA module. Like gene set enrichment analysis this produces a list of key terms, in this case pathways, and a p-value for the probability that the observed distribution of expression occurred by chance. The cut-off value for the pathway p-values was chosen at 0.05. An example of the effect of normalization on the pathway analysis is given in Figure 2. In this case there are a large number of additional pathways after farms normalization, but the pathways identified by the rma normalized analysis are conserved in the farms normalized results. It is possible that in this case farms normalization produces a dataset that is more sensitive to changes in pathway regulation, but it is also possible that a number of these pathways will be false positives. Figure 2 Pathway Analysis for E-GEOD-43458, Normal-Adenocarcinoma. (left) After rma normalization; (right) After farms normalization. 3.2.1. Pathways that Are Differentially Expressed between Normal Lung Tissue and Tumors Two of the datasets E-GEOD-18842 and E-GEOD-19188 compare normal lung tissue to NSCLC or tumor tissue, respectively. Of these E-GEOD-18842 has a much better defined list of differentially expressed genes. Again this is going to partly be a result of the larger size of the dataset as this results in a much lower standard error in the expression levels of the comparison groups and an increase in power of the experiment, but noise is also a factor as the final list contains almost all of the genes on the array. A cut-off had to be used for E-GEOD-19188 and only the top 2000 genes were used for pathway analysis. From the E-GEOD-18842 the three different normalization methods produce a very similar pathway analysis. The rma and farms normalized data give identical pathway analysis and identify 16 significant pathways (Figure 3). The results for the gcrma normalized data contain 12 out of the 16 previously identified pathways: deposition of new CENPA-containing nucleosomes at the centromere; telomere maintenance; unwinding of DNA and nucleosome assembly are absent, and separation of sister chromatids; resolution of sister chromatid cohesion; E2F mediated regulation of DNA replication; mitotic metaphase and anaphase; mitotic anaphase; DNA strand elongation and G2/M checkpoints are added. The results from E-GEOD-19188 are less definitive (Figure 4). From the rma and gcrma normalized data over 40 pathways are identified as significant. There is considerable overlap between these two pathway sets with 37 of the top 40 pathways being conserved between the two analyses. The farms normalized data only yields 29 significant pathways and some of those identified are notably different from the other results, especially the pathways associated with erythrocytes that are absent from the other pathway analyses (O2/CO2 exchange in erythrocytes, hemostasis, erythrocytes take up oxygen and release carbon dioxide, erythrocytes take up carbon dioxide and release oxygen). Associations between hemostasis, platelets and cancer were proposed in the 1980s, but these associations have only recently become the focus of renewed attention for pathway analysis [29,30,31]. Figure 3 Pathway Analysis for E-GEOD-18842, Normal-NSCLC (left) After rma normalization; (right) After gcrma normalisation. The overlapping pathways between the E-GEOD-18842 and E-GEOD-19188 contain many of the usual suspects, such as the control of mitosis and control of the cell cycle including checkpoints. The E2F mediated regulation of DNA replication has also been identified as important in a number of different cancers, as this transcription factor regulates cyclin E [32]. Two other pathways of interest are the Polo-like kinase mediated events and the activation of ATR in response to replication stress. Polo‑like kinase (PLK1) is a DNA damage checkpoint and it has been a target for cancer therapy [33,34]. Ataxia-telangiectasia and Rad3-related protein (ATR) halts DNA replication when replication forks are stalled and need to be repaired. The absence of gene creates fragile sites in the chromosome. ATR has been targeted as a possible cancer preventative or in boosting the effectiveness of existing therapies [35,36]. Figure 4 Pathway Analysis for E-GEOD-19188, Healthy-Tumor (left) After rma normalization; (right) After farms normalization. 3.2.2. Pathways that Are Differentially Expressed between Normal Lung Tissue and Adenocarcinoma Three datasets have a comparison between normal lung tissue and adenocarcinoma: E-GEOD-6044, E-GEOD-40725 and E-GEOD-43458. Of these the E-GEOD-43458 dataset is the largest and it is the only one that was specifically targeted to adenocarcinoma. The smallest of these datasets is E-GEOD-6044. This array also used the hgfocus gene array and not the more complete hgu133a array. From the rma and farms normalized data less than 10 pathways were identified and these were: biological oxidation; axon guidance; hemostasis; platelet activation, signaling and aggregation; platelet degranulation; translation of GLUT4 to the plasma membrane; response to elevated platelet cytosolic Ca2+; L1CAM interactions; phase 1-functionalisation of compounds; membrane trafficking and extracellular matrix organization. It is noteworthy that none of these pathways were amongst those identified as differences between the normal and tumor tissues and that they are not involved in cell cycle or DNA regulation. Phase 1-functionalisation is the processing and export of proteins from the endoplasmic reticulum and the reactions involve cytochrome P-450. Sugar transporters have previously been identified as having a role in cancer [37]. The L1 cell adhesion molecule L1CAM is associated with invasive tumors and metastasis [38,39]. After gcrma normalization pathway analysis identifies over 40 significant pathways (Supplementary Figure S1). These include the DNA regulation, mitotic pathways and cell cycle pathways previously identified in the normal-tumor comparisons. It also includes the meiosis pathways, negative regulation of rRNA expression and amyloids. The E-GEOD-40275 dataset uses an exon array and so it can only be normalized using rma. The results of pathway analysis are very different from those for the E-GEOD-6044 dataset and the normal‑tumor analysis. These include a number of cancer specific pathways involving mutants and also a group of pathways involved in mRNA regulation and processing (Figure 5). Figure 5 Pathway Analysis for E-GEOD-40275, Normal-Adenocarcinoma. The E-GEOD-43458 dataset could only be normalized with rma and farms as gcrma is incompatible with the array type used. The results from the pathway analysis have been shown in Figure 2. As stated above there is a large difference in the number of pathways identified between the two analyses, with many more differentially expressed pathways after farms normalization. Many of them are previously identified pathways involved in cell cycle regulation and mitosis. Amyloids appear in both of the pathway analyses and they were identified in the gcrma normalized pathway analysis of E-GEOD-6044. Serum amyloid a protein has been known for some time to be a biomarker for lung cancer and nasopharyngeal cancer [40,41]. Signaling pathways including the Rho GTPases, Wnt and TCF are also present in the post-farms pathway analysis. These signaling pathways have previously been associated with cancer, although TCF has mainly been associated with colorectal cancers [42,43,44,45,46]. The L1CAM pathway is also identified as being significant in agreement with the results from E-GEOD-6044. 3.2.3. Pathways that Are Differentially Expressed between Normal Lung Tissue and Squamous Cell Carcinoma Two datasets are available to compare the transcriptomes of normal lung tissue and squamous cell carcinoma, E-GEOD-6044 and E-GEOD-40275. The E-GEOD-6044 data shows a similar profile to the normal-adenocarcinoma pathway profile for E-GEOD-40275 including the TCF dependent signaling in response to Wnt (Supplementary Figures S2–S4). There are also shared pathways with the normal-adenocarcinoma analysis of the same dataset. The results suggest that squamous cell carcinoma has an affect on a more diverse range of pathways than adenocarcinoma. The E-GEOD-40275 normal-squamous cell carcinoma comparison identifies platelet associated pathways as well as RNA processing as the most important groups of differentiated pathways (Figure 6). These overlap with the pathways identified in the normal-adenocarcinoma comparison carried out with the same dataset and also the normal-adenocarcinoma pathway differences identified using the E-GEOD-6044 data. Figure 6 Pathway analysis for E-GEOD-40275, Normal-Squamous Cell Carcinoma. 3.2.4. Pathways that Are Differentially Expressed between Adenocarcinoma and Squamous Cell Carcinoma From the previous analysis it is apparent that there is considerable overlap in the pathways that are differentially expressed in adenocarcinoma and squamous cell carcinoma. There are three datasets that allow direct comparison between the transcriptomes of the two NSCLC sub-types, E-GEOD-6044, E-GEOD-40275 and E-GEOD-50081 (this dataset was specifically created to compare gene expression between cancer sub-types). Table 2 shows that there are only a small number of differentially expressed genes between the two NSCLC sub-types except in the E-GEOD-50081 dataset. It is therefore slightly surprising that after pathway analysis this large number of genes reduces to 13 pathways: unwinding DNA; type I hemidesmosome assembly; telomere maintenance; SIRTS1 negatively regulates rRNA expression; packaging of telomere ends; nucleosome assembly; metabolism of porphyrins; heme degradation; glucuronidation; DNA strand elongation; Deposition of new CENPA-containing nucleosomes at the centromere; condensation of prophase chromosomes and chromosome maintenance. The E-GEOD-6044 dataset has many fewer differentially expressed genes (although more than are differentially expressed between normal-adenocarcinoma and normal-squamous cell carcinoma). This produces a much more extensive list of pathways, with novel pathways involved in extra-cellular processes, such as extracellular matrix organization, cell-cell communication, cell-cell junction organization and immune system. The data however is very noisy and there is poor overlap between the results from pathway analysis after the three different normalization methods and so these results should be considered with some caution. Finally the E-GEOD-40275 also shows a small number of differentially expressed pathways. These agree with the extra-cellular pathways found in the E-GEOD-6044 results lending support to those findings, and also include new pathways involved in fatty acid triacylglycerol, and ketone body metabolism, as well as the metabolism of lipids and lipoproteins. There are also two pathways highlighted that are involved in the regulation of the peroxisome proliferator-activated receptor alpha (PPARA). There is some evidence that PPARA is associated with breast cancer but this is the first time it has been identified as involved in lung cancer although the gamma receptor has previously been identified and playing a role in inhibiting lung cancer cell growth [47,48].