> top > docs > PMC:4996402 > spans > 3976-3977

PMC:4996402 / 3976-3977 JSONTXT

An Overview of NCA-Based Algorithms for Transcriptional Regulatory Network Inference Abstract In systems biology, the regulation of gene expressions involves a complex network of regulators. Transcription factors (TFs) represent an important component of this network: they are proteins that control which genes are turned on or off in the genome by binding to specific DNA sequences. Transcription regulatory networks (TRNs) describe gene expressions as a function of regulatory inputs specified by interactions between proteins and DNA. A complete understanding of TRNs helps to predict a variety of biological processes and to diagnose, characterize and eventually develop more efficient therapies. Recent advances in biological high-throughput technologies, such as DNA microarray data and next-generation sequence (NGS) data, have made the inference of transcription factor activities (TFAs) and TF-gene regulations possible. Network component analysis (NCA) represents an efficient computational framework for TRN inference from the information provided by microarrays, ChIP-on-chip and the prior information about TF-gene regulation. However, NCA suffers from several shortcomings. Recently, several algorithms based on the NCA framework have been proposed to overcome these shortcomings. This paper first overviews the computational principles behind NCA, and then, it surveys the state-of-the-art NCA-based algorithms proposed in the literature for TRN reconstruction. 1. Introduction For every soccer team, the coach is responsible for directing the team to victory. The primary aim is to score as many goals as possible and, at the same time, thwart the other team from doing the same. The coach may choose some players over others. Even among team members, some players attack, others defend, whereas some are good as half-back players. Moreover, the core players that form the playing team do not remain the same throughout the game. Keeping in mind the dynamics of the game, the coach may direct some players to replace others, ensuring the primary aim of the game, to win, remains intact. In close comparison with this framework, the cell does not operate very differently. The coach of the cell, the DNA within the nucleus, directs different team members, TF and genes, to execute cellular functions and complex biological processes, which help the cell to adapt to varying dynamics, including external stimuli, as well as internal changes. The team members, TFs and genes, work together to express or suppress different metabolic pathways at different instances of the cell’s life. Particularly, these TFs contain DNA binding domains that allow them to bind to specific regions of DNA, called promoters [1]. By binding to these promoters, TFs initiate the process of converting genes into proteins. Transcription factor activities (TFAs) refer not only to the connectivity of any particular TF, but also to its level of activity. The connectivity of a particular TF informs its team members to collaborate in order to regulate RNA polymerase, which in its turn controls in terms of expressing or suppressing genes. TFAs cannot be measured directly; rather, they can be inferred from gene expression data. Furthermore, TRN represent interactions between genes and TFAs within a cell and offer a global perspective in the cellular behavior. Understanding the structure of TRNs and estimating TFAs provide insight into the cellular dynamics present in healthy and diseased tissues and organs and hold the potential to help in diagnosing, characterizing and determining cures for various diseases [2]. In the literature, several computational frameworks have been proposed to analyze regulatory interactions, which are briefly summarized below. The first class models the TRN as a dynamic system. Particularly, [3] and [4] describe gene expression as a linear and continuous time first-order differential equation. On the other hand, Boolean network models [5,6] quantize gene expressions by only two discrete levels: ON and OFF. The expression level of each gene is the Boolean function of the expression levels of other genes. These methods are generally performed using a small number of time series data and, thus, lead to an under-determined problem [7]. Another approach for TRN reconstruction is referred to as the co-expression (or relevance) networks, in which two genes are connected if the similarity between them exceeds a predefined threshold. Examples of similarity measures used in constructing relevance networks include correlation [8] and mutual information [9,10]. Relevance networks are helpful to understand the fundamental topological features of biological networks, but they do not infer causal relations among genes. The algorithms falling into the third category are commonly described as probabilistic graphical models [11,12,13], which include Gaussian graphical models (GGMs) and Bayesian networks (BNs). In GGMs, the network or graph is constructed based on the notion of conditional independence, and two genes are connected if and only if they are independent given the expression levels of all other genes. GGMs are formulated using undirected graphs and represent an example of full conditional models, since the conditional dependency is considered with respect to all other genes. On the other hand, BNs entail directed acyclic graphs, and the conditional dependency is measured with respect to all subsets of the other genes [11,14]. One limitation of probabilistic graphic models is that they have strong assumptions on the joint distribution that prevent representing or interpreting some biological relationships. For example, cyclic graphs are not allowed in the BN framework. In this way, it ignores self-feedback loops among genes that are natural features in genetic networks. Additionally, the applications of probabilistic graphic models are generally limited to the network with the number of experimental measurements significantly larger than the number of genes, since analyzing the structure of large-scaled genetic networks using probabilistic graphic models is highly complex. Besides dynamic models, co-expression networks and probabilistic graphical models, structural equation modeling (SEM) also represents a widely-used technique for TRN inference [15,16]. Generally, an SEM consists of a structural model and a measurement model. The structural model describes the causal relations between the latent variables, while the measurement model depicts the relations between latent variables and observed measurements. Recently, studies dedicated to TRN inference using the network component analysis (NCA) technique have begun to emerge in the literature [17]. NCA establishes a parameter estimation problem and reconstructs TRNs following a statistical signal processing viewpoint. Since NCA-based algorithms do not require time series data, they can collect the experimental data from different time intervals and combine them to increase the samples size and prevent the under-determination problem. Even with a limited number of experiments, NCA-based algorithms are still able to reconstruct TRNs with a large number of TFs and genes (See Section 3.3 for more details). Moreover, NCA-based algorithms take advantage of some prior knowledge about the connectivity patterns of the genetic network, which is becoming available via high-throughput experiments [18] or data mining of interaction information [19,20,21]. The assumed mathematical model for NCA is represented by the following system of linear equations [17]:(1) X=AS+Γ where X∈RN×K represents the log ratios of expression values of N genes at K time points of the microarray dataset, A∈RN×M denotes the connectivity strength between N genes and M TFs, S∈RM×K stands for the activities of M TFs at K time points and Γ∈RN×K represents the measurement noise. Examples of two TRNs with six genes and four TFs, but different connectivity topologies, are shown in Figure 1. Generally, in Equation (1), X cannot be uniquely decomposed as the product of two matrices A and S, unless further constraints are imposed. Principal component analysis (PCA) [22] and independent component analysis (ICA) [23] represent two conventional statistical algorithms that can provide valid solutions provided that the input signals present in S are independent and/or orthogonal. However, such an assumption generally does not hold for biological signals in practice. Accounting for this fact, Liao et al. [17] proposes NCA, which incorporates the prior information about TF-gene regulation, to infer TRNs. As will be discussed in detail in Section 2, NCA is an iterative computational algorithm that ensures the uniqueness of decomposition solutions. Due to some drawbacks of NCA, such as the stringent conditions required to apply NCA, several alternative NCA-based algorithms have been proposed in the literature to improve NCA from different perspectives, such as less restrictive assumptions, lower computational complexity and higher robustness against noise, outliers and modeling errors. Figure 1 Examples of two transcription regulatory networks (TRNs) with six genes and four transcription factors (TFs), but different connectivity topologies. The rest of the paper, which proposes to provide a review of the major algorithms reported for NCA, is organized as follows. Section 2 introduces the NCA framework and the mathematical details of the NCA algorithm. Extensions of NCA are presented in Section 3. These extensions still rely on the NCA algorithm, but improve the applicability range of NCA by requiring less stringent assumptions. In Section 4, alternative NCA-based algorithms proposed in the literature for TRN inference are surveyed. A few illustrative computer simulation results highlighting the performance of major NCA algorithms are presented in Section 5. In addition, the comparison of these algorithms and some recommendations on how to choose the appropriate algorithm are discussed in Section 6 based on the simulation results in Section 5. Finally, Section 7 summarizes the content of this paper. 2. NCA In the case when both matrices A and S are unknown, the decomposition problem in Equation (1) admits an infinite number of solutions. Fortunately, prior information is becoming available for many biological systems, e.g., ChIP-on-chip (ChIP-on-chip (also known as ChIP-chip) represents a technology that combines chromatin immunoprecipitation (“ChIP”) with a DNA microarray (“chip”)) data indicate whether a certain gene interacts with a certain TF. This prior information is incorporated within NCA mathematically via the constraint A(I)=0, where I presents the indices of zero elements in the connectivity matrix A, indicating a certain level of connectivity information. NCA requires three identification criteria to ensure a unique solution up to a scalar ambiguity:1 The connectivity matrix A must be full-column rank.2 If a column of A is removed along with all of the rows corresponding to the nonzero entries of the removed column, the remaining matrix must still be full-column rank.3 The TFA matrix S must have full row rank. To test whether the system meets the above-mentioned first two criteria, matrix A must be first initialized based on the prior knowledge available about connectivity. Specifically, aij is assigned to zero if (i,j)∈I, and it assumes any arbitrary nonzero value otherwise. Once A is initialized, matrix A is tested to see if it presents a full-column rank. Then, we sequentially remove each column of A, as well as the genes connected to the removed TF and test whether the remaining reduced matrix still presents full-column rank. Consider TRNs in Figure 1 as an example. The initialized connectivity matrices for Figure 1a,b are illustrated in Figure 2a,b, respectively. The initialized connectivity matrix in Figure 2a is not identifiable, since the reduced matrix obtained by removing the first column along with the first, third and fifth rows is not full-column rank. This condition violates the second criterion of NCA. The initialized connectivity matrix in Figure 2b, on the other hand, satisfies all three identification criteria. In terms of the third criterion, it cannot be tested a priori, but it implies that the number of TFs must be less than or equal to the number of time points, i.e., M≤K. This rank criterion is verified after S is simulated using NCA [17]. Figure 2 An example of (a) a non-identifiable pattern and (b) an identifiable pattern. NCA aims to solve the following optimization problem:(2) minA,S ||X−AS||F2,s.t. A(I)=0, where ||·||F denotes the Frobenius norm. NCA employs an alternate least-squares (ALS) approach to iteratively update A and S. At iteration j, given A(j−1), i.e., the value of A at iteration (j−1), the estimate of S(j) is obtained by solving the following least-squares (LS) problem:(3) S(j)=argminS||X−A(j−1)S||F2s.t.  si,j(l)≤si,j≤si,j(u), where the constraint is included to ensure that the elements of S remain in the domain of biologically-sensitive values [17]. The optimization problem Equation (3) can be solved by standard convex optimization tools, such as the interior point method [24]. Once S(j) is obtained, the next step is to update A(j) via the following optimization problem:(4) A(j)=argminA||X−AS(j)||F2s.t.  A(I)=0, ai,j(l)≤ai,j≤ai,j(u), where the constraint ai,j(l)≤ai,j≤ai,j(u) is also used to constrain the domain of A. Particularly, eliminating the zero elements in A removes the connectivity constraint A(I)=0. This leads to a new least-squares problem with a lesser number of variables, which can also be solved using the same method employed to solve Equation (3). If the decrease in the total least-squares error after updating A is above a preset value e, the algorithm keeps running. Otherwise, it stops. A diagram illustrating the operation of the NCA is shown in Figure 3. Simulation results in [17] demonstrate that NCA was successfully applied to the microarray data generated from yeast Saccharomyces cerevisiae, and the activities of various TFs during the cell cycle were reconstructed. Figure 3 Network component analysis (NCA) algorithm. 3. Extensions of NCA Despite its successful implementation in yeast data, NCA exhibits several shortcomings, which prevent its application to a wide class of regulatory network inference problems. In the literature, several papers have been proposed to tackle these issues. In this section, we focus on several improvements for NCA proposed recently in the literature. In these works, the core estimation methods are identical to NCA, but some enhancements have been implemented to make the NCA algorithm more applicable to various setups. 3.1. Motif-Directed NCA In the original NCA work [17], the prior information about the connectivity matrix, i.e., A(I), is provided by high-throughput experiments. However, the high-throughput ChIP-on-chip data are not available for some common species, such as rodents and humans [25]. With respect to this fact, Wang et al. [25] proposed a motif-directed NCA (mNCA) algorithm, which incorporates the motif information to obtain the prior network structure information and to infer TRNs. Due to the fact that the regulation between TFs and genes occurs only after TFs bind to the DNA sequence motifs in the gene’s promoter region [25], the authors incorporate the motif information to recover the interaction between TFs and genes. Moreover, since the prior topology information, either from ChIP-on-chip data or motif analysis, comes from biological experiments, it may contain many false positives/negatives. Thus, a stability analysis is further proposed in [25] to extract stable TFAs from the NCA algorithm. Specifically, the authors of [25] intentionally perturb the connectivity information and use the Pearson correlation coefficient as a stability measurement to determine whether the estimated TFAs are stable or not. Experimental results on muscle regeneration microarray data demonstrate that mNCA is able to reveal important TFAs, as well as their connectivity strength to corresponding genes. 3.2. Generalized NCA The work in [26] proposed the generalized NCA (gNCA) in an attempt to improve the NCA criteria. gNCA extends the system identification criteria required by NCA by additionally incorporating the prior information about regulatory matrix S, such as the regulatory information obtained from regulatory gene knockouts (a gene knockout (KO) refers to a genetic technique through which one or more genes from an organism are made inoperative (“knocked out”)) [26]. Thus, for the gNCA criteria to guarantee a unique decomposition solution, they require a full column rank condition for A, a full row rank condition for S and an additional condition that preserves the essential features of A and S. In this way, given the topology information about S, the uniqueness of the decomposition problem might still be ensured by alternatively checking the gNCA criteria, even if the connectivity structure of A does not satisfy the NCA criteria. Even when the connectivity topology satisfies the NCA criteria, gNCA reduces the number of parameters to be estimated by combining the prior information about S. 3.3. Revised NCA The work in [27] also focuses on enhancing the NCA criteria. The work in [27] proposed revised NCA (NCAr), where the third criterion of NCA is revised to improve the applicability of NCA. As discussed earlier, to ensure a unique solution for the matrix factorization problem, the third criterion of NCA requires the matrix S to have full row rank, which implies that the number of TFs must be less than or equal to the number of experiments. This requirement significantly limits the sample size of TFs. The work in [27] revises the third criterion of NCA based on the observation that most of the genes are only regulated by a smaller number of TFs than the total number of TFs (i.e., the connectivity matrix A is row-wise sparse). In particular, this condition, instead of being associated with the rank properties of matrix S, is related to the rank properties of reduced-size matrices. Particularly, it requires that the number of experiments for each gene be greater than or equal to the number of TFs regulating that gene. The revised criterion enables NCA to be applicable to a wider class of TRN inference problems, since the number of TFs regulating a gene is generally less than five or six [27]. In this way, a large dimensional regulatory network can be uniquely inferred, even in the presence of a limited number of experiments. 3.4. Generalized-Framework NCA The original NCA work requires the biological system to satisfy all three criteria to ensure a unique decomposition up to a scaling factor. However, NCA only checks the compliance for the initialized matrix A. It may occur that the derived matrix A at certain iterations violates the NCA criteria. The work in [28] generalizes the NCA criteria, such that the system identification can be determined directly from the connectivity (topology) information, rather than checking the rank properties of the unknown connectivity matrix A. In other words, if a certain connectivity topology, i.e., A(I0), meets the newly-derived conditions, then all matrices A∈A(I0) satisfy the first and second criterion of NCA, and thus, they guarantee the feasibility of A during each iteration of the NCA algorithm. To deal with the issue that the connectivity topology does not satisfy the newly-derived conditions or the TF matrix does not satisfy the third criterion of NCA (for example, when M>K, the linear independence of TFs is violated), the authors in [28] alternatively seek to infer subnetworks by removing the selected TF node together with all of its associated genes until all of the system identification criteria of the reduced subnetwork are verified. The resulting algorithm is referred to as generalized-framework NCA (gfNCA) [28]. 4. Alternative NCA-Based Algorithms In this section, we review some alternative NCA-based algorithms that were also recently reported in the literature. Different from the algorithms discussed in Section 3, where mNCA, gNCA, NCAr and gfNCA utilize the NCA algorithm to infer TRNs, the algorithms discussed in this section focus on designing more efficient algorithms to estimate the matrices A and S in the NCA system model Equation (1). These algorithms can be roughly classified into two classes, namely the iterative and the non-iterative class. 4.1. Iterative NCA Algorithms As described in Section 2, NCA adopts the ALS approach to iteratively update matrices A and S. Therefore, NCA, along with all of the algorithms that employ ALS, such as mNCA, gNCA, NCAr and gfNCA, is an iterative method. Another example of iterative methods, referred to as robust NCA (ROBNCA), is reviewed next. Robust NCA ROBNCA [29] is a robust NCA-based approach that tries to cope with the possible noise and outliers present in the microarray data due to erroneous measurements and/or the abnormal response of genes [30]. To counteract the presence of outliers, the system model of TRNs is formulated as:(5) X=AS+O+Γ where matrix O models explicitly the presence of outliers. Since typically, only a few outliers exist, the outlier matrix O represents a column-sparse matrix. Accounting for the sparsity of matrix O, ROBNCA aims to solve the following optimization problem:(6) {A^,S^,O^}=argminA,S,O||X−AS−O||F2+λ0||O||0s.t.   A(I)=0, where ||O||0 denotes the number of nonzero columns in O and λ0 is a penalization parameter used to control the extent of sparsity of O. Due to the intractability and high complexity of computing the l0-norm-based optimization problem, the problem Equation (6) is relaxed to:(7) {A^,S^,O^}=argminA,S,O||X−AS−O||F2+λ2||O||2,cs.t.   A(I)=0 where ||O||2,c stands for the column-wise l2-norm sum of O, i.e., ||O||2,c=∑k=1K||ok||2, where ok denotes the k-th column of O. Since the optimization problem Equation (7) is not jointly convex with respect to {A,S,O}, an iterative algorithm is performed in [29] to optimize Equation (7) with respect to one parameter at a time. Towards this end, the ROBNCA algorithm at iteration j assumes that the values of A and O from iteration (j−1), i.e., A(j−1) and O(j−1), are known. Defining Y(j)=X−O(j−1), the update of S(j) can be calculated by carrying out the optimization problem:S(j)=argminS||Y(j)−A(j−1)S||F2 which admits a closed-form solution. The next step of ROBNCA at iteration j is to update A(j) while fixing O and S to O(j−1) and S(j), respectively. This can be performed via the following optimization problem:(8) A(j)=argminA||Y(j)−AS(j)||F2.s.t.   A(I)=0 The problem Equation (8) was also considered in the original NCA paper [17] in which a closed-form solution was not provided. Since this optimization problem has to be conducted at each iteration, a closed-form solution is derived in ROBNCA using the re-parameterization of variables and the Karush–Kuhn–Tucker (KKT) conditions to reduce the computational complexity and improve the convergence speed of the original NCA algorithm. In the last step, the iterative algorithm estimates the outlier matrix O by using the iterates A(j) and S(j) obtained in the previous steps, i.e., (9) O(j)=argminok||C(j)−O||22+λ2||O||2,c where C(j)=X−A(j)S(j). The solution to Equation (9) is obtained by using standard convex optimization techniques, and it can be expressed in a closed form. It can be observed that at each iteration, the updates of matrices A, S and O all assume a closed-form expression, and it is this aspect that significantly reduces the computational complexity of ROBNCA when compared to the original NCA algorithm. In addition, the term λ2||O||2,c guarantees the robustness of the ROBNCA algorithm against outliers. Simulation results in [29] also show that ROBNCA estimates TFAs and the TF-gene connectivity matrix with a much higher accuracy in terms of normalized mean square error than FastNCA [31] and non-iterative NCA (NINCA) [32], irrespective of varying noise, the level of correlation and outliers. 4.2. Non-Iterative NCA Algorithms This section presents four fundamental non-iterative methods, namely, fast NCA (FastNCA) [31], positive NCA (PosNCA) [33], non-negative NCA (nnNCA) [34] and non-iterative NCA (NINCA) [32]. These algorithms employ the subspace separation principle (SSP) and overcome some drawbacks of the existing iterative NCA algorithms. FastNCA utilizes SSP to preprocess the noise in gene expression data and to estimate the required orthogonal projection matrices. On the other hand, in PosNCA, nnNCA and NINCA, the subspace separation principle is adopted to reformulate the estimation of the connectivity matrix as a convex optimization problem. This convex formulation provides the following benefits: (i) it ensures a global solution; (ii) it allows usage of efficient convex programming techniques, like the interior point method [24]; and (iii) it offers the flexibility of adding additional convex constraints. Since SSP represents the core technique of these non-iterative NCA-based algorithms, this important concept is first explained in the next subsection. 4.2.1. Subspace Separation Principle Assume matrix X is decomposed into the sum of two other matrices X=B+Γ, where X∈RN×K(K

Document structure show

projects that have annotations to this span

Unselected / annnotation Selected / annnotation
2_test (1)