PMC:4105829 / 334-341 JSONTXT

Analyzing networks of phenotypes in complex diseases: methodology and applications in COPD Abstract Background The investigation of complex disease heterogeneity has been challenging. Here, we introduce a network-based approach, using partial correlations, that analyzes the relationships among multiple disease-related phenotypes. Results We applied this method to two large, well-characterized studies of chronic obstructive pulmonary disease (COPD). We also examined the associations between these COPD phenotypic networks and other factors, including case-control status, disease severity, and genetic variants. Using these phenotypic networks, we have detected novel relationships between phenotypes that would not have been observed using traditional epidemiological approaches. Conclusion Phenotypic network analysis of complex diseases could provide novel insights into disease susceptibility, disease severity, and genetic mechanisms. Background Complex diseases like diabetes, stroke, many types of cancer, and chronic obstructive pulmonary disease (COPD) are likely heterogeneous syndromes composed of multiple disease subtypes that manifest a similar pathological or physiological outcome. These subtypes may have different genetic determinants. In order to understand this heterogeneity, a variety of clinical, physiological, imaging, pathological, and biochemical disease-related phenotypes have been analyzed [1]. In standard clinical epidemiological approaches, univariate and multivariate regression analyses are performed to determine significant and independent predictors of disease development. However, the available disease-related phenotypes may be crude assessments of disease pathophysiology; any analyses that are performed may be confounded by grouping multiple subtypes together. The challenge we face, in part, is deconvoluting these disease-related phenotypes and defining their relationships to one another and to specific genetic determinants. Network analysis has the potential to provide a holistic approach to the understanding of disease complexity, rather than focusing on individual components of disease [2]. Network approaches can capture emergent properties that are not apparent when network components are analyzed in a pair-wise manner. However, network medicine approaches to complex diseases have largely focused on relating a disease to the underlying cellular and molecular interaction network [3]. Correlation-based networks have been frequently used to analyze gene expression data [4,5], but these methods have not been widely applied to the study of disease-related phenotypes. Barabási and colleagues [6] used diagnostic coding data to assess phenotypic network relationships between different disease categories, but not to analyze multiple quantitative phenotypes within one complex disease. Using COPD as an example, we describe the application of network inference methods to explore the relationships between disease-related phenotypes that have been found to be relevant in determining disease severity and outcome, and, ultimately, to begin to define the complex heterogeneity of the disease. Methods Network inference and comparison To infer phenotypic networks, we used the Gaussian graphical model (GGM) introduced by [7] and [8]. Briefly, the model, which is based on the assumption that the variables have Gaussian distributions, infers the connection between each pair of variables and creates a phenotypic network based on partial correlations. Assume that we have P phenotype variables and K subjects. We begin by constructing a P×K matrix, Y, where we assume that the elements of Y follow a multivariate normal distribution: Y i = y 1 i , .... y Pi T ∼ N P μ Y , Σ Y , i = 1 , ....K , Here, y j i represents the jth phenotype variable in the ith subject, μ is the mean vector and Σ is the covariance matrix. The covariance matrix Σ Y and the partial correlation matrix (denoted by Ω) for Y are estimated (see [9]). The partial correlation (PCOR) ω j k measures the correlation between variable j and variable k while controlling for all other variables. Therefore, ω j k represents the conditional dependency between variable j and variable k, with ω j k =0 if the two variables are independent conditional on all other variables and ω j k ≠0 if they are conditionally correlated. For each pair of variables that are conditionally dependent, the presumed causal relationship between the variables is a direct one and independent of all other variables. We assume that these partial correlations represent the hidden connections between phenotypic variables that may help to refine disease subtypes. Under the null hypothesis in which all variables are independent, Hotelling [10] gives the null distribution of sample partial correlation ω as p ( ω | κ ) = 1 − ω 2 κ − 3 / 2 Γ ( κ / 2 ) π 1 / 2 Γ ( κ − 1 ) / 2 , where κ is the degrees of freedom (K−P+1). Therefore, we can compute the p-values for the estimated partial correlation coefficients for each pair of phenotypic variables and test for the presence of a significant connection between those variables in the phenotypic network. In addition, we can also test for differences in the network connectivity between two groups of subjects by permutation tests. For example, to test for differential connectivity between COPD cases and controls, we randomly swap the labels of cases and controls and calculate the PCORs in the shuffled groups, repeated 10,000 times, to obtain the distribution of PCORs under the null hypothesis in which the presence or absence of connections is not associated with the case-control status. The empirical p-values are reported. Analogously, we have also tested differential connectivity between different genotypes for two previously identified genome-wide significant SNPs associated with COPD using the same approach. Opgen-Rhein and Strimmer [11] have extended the GGM method to infer the directionality of the edges between each pair of variables. They proposed a test of directionality based on the log-ratios of standardized partial variances. This method enables identification of a “partially directed graph” where some of the significant edges identified by GGM methods will have directions, which might imply causality, while other edges remain undirected. Study populations and phenotypic variable selection COPD is a disease defined by abnormal physiology, with chronic airflow obstruction as the common, key feature [12]. Chronic airflow obstruction is characterized by reductions in the forced expiratory volume in one second (FEV1) and in the ratio of the FEV1 to the forced vital capacity (FVC), which are assessed by spirometry. Clinical epidemiological studies have identified multiple factors that contribute to COPD, including cigarette smoking (often quantified as pack-years, where an average of one pack of cigarettes smoked per day for one year is one pack-year) and increasing age. In addition, a variety of disease-related phenotypes have been studied related to imaging, exercise capacity, respiratory symptoms, and physiology. Computerized tomography (CT) imaging enables assessment of the severity and distribution of emphysema–the destruction of lung parenchyma–as well as thickening of airways [13-15]. The underlying assumption in our analysis is that these phenotypic variables are not independent, but, rather, interact to define distinct groups of patients (subtypes). By defining these subtypes, we might better be able to classify patients, understand their unique disease characteristics, and ultimately direct them to appropriate therapies. The COPDGene Study [16] is a multi-center genetic and epidemiologic investigation to study COPD and other smoking-related lung diseases. In this study, 10,192 smokers (including 6,784 non-Hispanic Whites (NHW) and 3,408 African-Americans (AA)) have completed a detailed protocol, including questionnaires, pre-and post-bronchodilator spirometry, high-resolution CT scanning of the chest, exercise capacity (assessed by six minute walk distance), and blood samples for genotyping. Samples were genotyped using the Illumina OmniExpress platform, which assayed genetic polymorphisms at over 700,000 sites along the genome; the genotype data have gone through standard quality-control procedures for genome-wide association analysis. Briefly, a total of 221 subjects and 83,423 markers were excluded for quality control reasons, including identity-by-descent, gender mismatches, genotype missingness, Hardy-Weinberg disequilibrium in controls, and low minor allele frequency. The details of the quality control procedures are available at http://www.copdgene.org/sites/default/files/GWAS_QC_Methodology_20121115.pdf. For phenotypic network analysis, we selected 10 key quantitative COPD-related phenotypes based on clinical experts’ opinions (co-authors EKS, CPH, and MHC). The phenotypes were chosen to represent major disease-related components, including imaging, physiology, exercise capacity, and exacerbations, as well as important demographic variables (Table 1). Although over 300 variables were captured by questionnaires, clinical assessments, and CT scanning in COPDGene, we chose phenotypes to avoid duplicate assessment of the same aspect of the disease (e.g., lung function, emphysema severity, and airway wall thickness). For example, we included FEV1 but excluded FEV1/FVC, as they are both lung function phenotypes which assess airflow obstruction. Subjects with missing data in any of the 10 quantitative variables were excluded. Therefore, a complete set of 8,141 subjects were used in the following analyses, including 5,478 NHWs and 2,514 AAs. Case subjects were defined by FEV1 <80% predicted and FEV1/FVC <0.7, while control subjects were defined by FEV1 ≥80% predicted and FEV1/FVC ≥0.7. In addition to assessment based on case-control status, we compared groups of subjects homozygous for risk- and non-risk alleles at known GWAS SNPs, excluding heterozygotes from the genotype-stratified phenotypic networks to maximize phenotypic effects. To assess the impact of including phenotypic variables that are not closely related to COPD on our phenotypic networks, we also created networks including heart rate and systolic blood pressure as well as networks including two randomly generated variables. Table 1 Description of phenotypic variables Variables (abbreviation) Descriptions/Comments FEV1 (% predicted FEV1) Observed FEV1 (liters)/predicted FEV1 (liters), with predicted valued from Hankinson reference equations Emphysema (Emph) % Emphysema at -950 Hounsfield units(HU) Emphysema Distribution (EmphDist) Log ratio of emphysema at -950 HU in the upper 1/3 of lung fields compared to the lower 1/3 of lung fields Gas Trapping (GasTrap) Air trapping at -856HU on expiratory chest CT scan Airway Wall Area (Pi10) Square root of the wall area of a hypothetical 10 mm internal perimeter airway Exacerbation frequency (ExacerFreq) Number of COPD exacerbations during the year before study enrollment Six minute walk distance (6MWD) Measure of exercise capacity BMI Body Mass Index Age In years Pack-Years (PackYear) One pack-year is defined as smoking one pack (20 cigarettes) per day for one year Evaluation of COPD Longitudinally to Identify Predictive Surrogate Endpoints (ECLIPSE, [17]) is a large longitudinal study of COPD patients and controls with comprehensive phenotyping similar to COPDGene. Therefore, we used a subset of 1,705 COPD cases (including 1,667 white subjects) with complete data for the 10 quantitative variables at their baseline study visit to build phenotypic networks. All variables in Table 1 were available in ECLIPSE, except for Emphysema Distribution and Gas Trapping. Therefore, networks with 8 variables were built for both COPDGene and ECLIPSE for comparison. Results Whole population phenotypic network in COPDGene The ten selected COPD-related phenotypes in COPDGene were found to be highly connected in the whole study population. Out of 45 pairs of phenotypes, 37 had significant PCORs with p-values <0.05, and 29 pairs were significant with p-values <0.001 (density = 64.44%, where the density of a network is defined by the portion of all possible connections in a network that are actual connections, see Figure 1 and Table 2). The most highly connected nodes were FEV1 and Gas Trapping (see Figure 1), with Gas Trapping significantly connected with all of the analyzed phenotypes. In addition, the 16 pairs that were not directly connected (p-values >0.001) were connected through only one transitive node based on shortest path analysis [18]. The majority of shortest paths connected through gas trapping (9 out of 16), suggesting that gas trapping is a “hub” in the phenotypic network. This finding is consistent with the high correlation observed between CT gas trapping and spirometric measures [19], and also with the observation that CT gas trapping encompasses the two major pathological processes in COPD–emphysema and small airway disease. Most edges in this whole population network remained statistically significant after we stratified by race, while the NHW network edges were slightly more significant than the AA network likely due to larger sample size and better power. FEV1 and Gas Trapping remained highly connected in the race-stratified networks. The top four pairs (CT Emphysema/Gas Trapping, FEV1/Gas Trapping, FEV1/Pi10, and Gas Trapping/Age) all stayed consistently top-ranked for the whole population and race-stratified networks and were all highly significant (see Table 2). Figure 1 Whole population network (N =8,141). Undirected edges denote partial correlation coefficients that were significant at p<0.001. Table 2 Edges of whole population network with p-values < 0.001 Node 1 Node 2 Whole population NHW AA P-value PCOR P-value PCOR P-value PCOR Emphysema Gas Trapping

Document structure show

Annnotations TAB TSV DIC JSON TextAE Lectin_function IAV-Glycan

  • Denotations: 0
  • Blocks: 0
  • Relations: 0