Briefings

PMC:6585387 JSON TXT 3 Projects

Systems Bioinformatics: increasing precision of computational diagnostics and therapeutics through network-based approaches Abstract Abstract Systems Bioinformatics is a relatively new approach, which lies in the intersection of systems biology and classical bioinformatics. It focuses on integrating information across different levels using a bottom-up approach as in systems biology with a data-driven top-down approach as in bioinformatics. The advent of omics technologies has provided the stepping-stone for the emergence of Systems Bioinformatics. These technologies provide a spectrum of information ranging from genomics, transcriptomics and proteomics to epigenomics, pharmacogenomics, metagenomics and metabolomics. Systems Bioinformatics is the framework in which systems approaches are applied to such data, setting the level of resolution as well as the boundary of the system of interest and studying the emerging properties of the system as a whole rather than the sum of the properties derived from the system’s individual components. A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers. Here, we provide evidence on how Systems Bioinformatics enhances computational therapeutics and diagnostics, hence paving the way to precision medicine. The aim of this review is to familiarize the reader with the emerging field of Systems Bioinformatics and to provide a comprehensive overview of its current state-of-the-art methods and technologies. Moreover, we provide examples of success stories and case studies that utilize such methods and tools to significantly advance research in the fields of systems biology and systems medicine. Introduction Biological data, either as large-scale omics or as classical biodata, are the footprints of biological mechanisms. These mechanisms consist of numerous synergistic effects emerging from various systems of interwoven biomolecules, cells and tissues. Therefore, it is necessary to explore them with a systemic approach to reveal the behaviour of the system as a whole rather than as the sum of its parts. Systems biology provides a holistic perspective on biological mechanisms via the integration of information and knowledge from multiple interdisciplinary fields (such as biology, chemistry, mathematics, computer science and physics). It aims to elucidate synergistic relationships between multiple factors in contrast to representing them as single entities and can lead to the generation of complex molecular networks of interactions modelled by computational or mathematical approaches. Systems biology harnesses its power from technological advances in the field of ‘omics’ and the advent of next-generation sequencing. These technologies provide a spectrum of information ranging from genomics, transcriptomics and proteomics to epigenomics, pharmacogenomics, metagenomics and metabolomics. Bioinformatics and computational biology have made significant breakthroughs towards the analysis and interpretation of the data obtained from the above-mentioned omics technologies. The sheer size of data generated by these high-throughput methodologies, coupled with the need to analyse, integrate and concurrently interpret this avalanche of information in a systemic way, has paved the way to the upcoming field of Systems Bioinformatics. Systems Bioinformatics is a relatively new approach, which lies in the intersection of systems biology and classical bioinformatics. It focuses on integrating information across different levels using a bottom-up approach—as adopted in systems biology—with a data-driven top-down approach used in bioinformatics. The bottom-up approach in systems biology typically brings together information from molecular cells and tissues in the framework of mathematical models to generate insights on the function and dynamic behaviour of cells, organs and organisms. The top-down approach uses bioinformatics methods to extract and analyse information from ‘omics’ data generated through high-throughput techniques. Initially, informatics approaches to systems biology focused mainly on modelling and simulation. However, owing to the lack of sufficient experimental data, these methods fell short in building reliable models. Subsequently, the explosion of multilevel data generation brought a plethora of new methods, tools and solutions capable of studying systemic properties. The application of systemic approaches such as information theory, statistical inference, probabilistic models, graph theory and further network science approaches in the analysis of biological data paved the way to the creation of a distinct field, namely, Systems Bioinformatics. Depending on the availability, the quality and the comprehensiveness of the data, Systems Bioinformatics’ methods contribute significant benefit in narrowing down the gap between genotype to phenotype as well as providing additional information regarding biomarker and drug discovery. These methods are applied to classical biological data, clinical/patient data and omics data as well. They are suitable for extracting precise and personalized results, thus, facilitating systems medicine (medicine that is in bidirectional interaction with computational multiscale analysis and modelling of disease-related mechanisms) and more specifically P4 Medicine (medicine that is personal, participatory, predictive and preventive) [1, 2] (as illustrated in Figure 1). Figure 1 Systems Bioinformatics. A schematic representation of the emergence of Systems Bioinformatics as a distinct discipline among other interrelated and interdependent disciplines. The information provided by Bioinformatics, Biology and Systems Biology is integrated in the Systems Bioinformatics framework through computational integration and network-based and other holistic approaches to tackle challenges in Systems Medicine and in particular P4 Medicine. The delivery of individually adapted medical care of high precision, based on multi-source patient information across various levels and in various scales, is the basic idea of modern medicine having various appellations depending on the emphasis given (e.g. translational/systems/P4/precision/personalized medicine). The omics spectrum offers the opportunity and the challenge for multiscale and multi-source analysis towards building a comprehensive profile of the Human System (Figure 2). The major challenges faced by Systems Bioinformatics towards this demanding form of medicine are: (i) the design and development of suitable bioinformatics pipelines to provide valid and sufficient biological information from the high-throughput molecular profiles of the patient, (ii) the development of robust information systems capable for data integration, information extraction and knowledge sharing, (iii) the construction of mathematical models to predict the evolution of a particular disease, its relation with the measured markers, its tolerance/resistance to various drug families and the existing risks to the patient. These challenges can be tackled with state-of-the-art computational methodologies and techniques, such as computational intelligence, machine learning, pattern recognition and data mining, modelling and simulation, network reconstruction and visualization, complex network analysis, deep learning, text mining/semantics and association analysis. Further to these, Systems Bioinformatics serves as the framework for the development of powerful computational methods and tools to create user-friendly platforms to visualize and analyse big and heterogeneous information in the form of a network. Figure 2 Network Integration. Multiscale and multisource data generated from the Human System can be represented in network form. These networks can be further analysed and, importantly, they can be integrated forming supernetworks and building a comprehensive profile of the Human System. This review is structured in three main sections. In the first section on ‘Systems Bioinformatics’ we begin with an overview of the systems theory approach for complex biological problems. We then provide an in-depth summary of the network science approach in System Bioinformatics. We introduce certain basic network measures, which are used to analyse the components of a network, both locally and globally, and discuss the biological interpretation of such measures. We then describe in detail key biological network construction methods followed by a discussion on module-based approaches and network signatures. Finally, we describe network manipulation methods in the ‘Network controllability’ and ‘Network integration’ subsections. In the following subsection we provide a short summary regarding modelling and simulation approaches followed by a short discussion on the infrastructures and data management challenges in the field of Systems Bioinformatics. In the last two sections we provide an overview of methods and case studies with regards to the Systems Bioinformatics applications in biomarker and drug discovery. Systems Bioinformatics Systems approaches Biological data have tremendously expanded both in size and complexity. Systems Bioinformatics focuses on the investigation of such vast and complex biological systems and their within interactions using a ‘holistic’ rather than a ‘reductionist’ approach, much like the systems biology field. A holistic approach to science and the analysis and description of a complex phenomenon emphasizes the whole and the interaction of its parts, whereas the reductionist approach focuses on the fundamental parts. In fact, the debate on reductionism versus holism has its roots in ancient years. According to reductionism proponents, the optimal method to understand any science is the decomposition in smaller components. Moreover, in its greedy form, reductionism may see the whole science as physics. Even in its layered-model form, reductionism considers human/health sciences as based on biology, biology based on chemistry and chemistry based on physics. On the other hand, a strong dissent has formulated a solid antireductionism trend. This trend has either epistemological or ontological origins, supporting that complete reductionism is technically impossible and that there are emergent laws that govern the system and cannot be derived from the laws governing the components of the system. Furthermore, it is supported that each system has a ‘buffering capacity’ where many micro-states correspond to fewer macro-states of the system, making reductionism to be considered as pointless after a certain decomposition level [3]. The reductionism’s approach in biology is epitomized by molecular biology, which in the past two decades has led to the generation of a plethora of omics data. These data provide information on the building blocks of the entire organism at different scales and for different types of cells, tissues and organs. Data on DNA fragments, genes, RNA fragments, peptides, proteins and metabolites measured in short time and space intervals provide a spatiotemporal distribution of these building blocks under various states of the organism. These interwoven building blocks control and are controlled by signals in a non-linear way. As a result, the understanding of the system requires something more than simply the bottom-up assembly of the system’ components. Systems theory, which is a holistic approach, addresses the limitations from the reductionism’s point of view by considering the system as a whole, adopting a top-down approach [4]. Thus, it studies the emergent properties of the system such as homeostasis, adaptivity, tolerance, stability and modularity, through some basic overlying hierarchical principles such as entropy, positive and negative feedback control [4]. In the context of Systems approaches, the graph theory and the further science of networks have been successfully applied to the investigation of complex phenomena across a range of different scientific disciplines. The theoretical context of complex networks approach includes concepts that are derived from information theory, dynamical systems, statistical physics and topology approaches, as well as several mathematical methods suited for the analysis of the interaction of components in a complex network. In the following subsection we provide an in-depth overview of such network approaches and examples of their use in Systems Bioinformatics. Networks Biological network basics Casting biological systems as networks and analysing their topology can be useful in understanding how such systems are organized. Graph theory provides a powerful mathematical framework for the understanding of the organization of such large and complex systems by considering them in the form of graphs [5, 6]. Graphs, also termed as networks, can be used to model the pairwise relations between objects. A network is a collection of nodes or vertices connected by edges, arcs or lines (as shown in Figure 3A). It may be undirected, meaning that there is no distinction between the two nodes associated with each edge, or its edges may be directed from one node to another (as shown in Figure 3B). In cell biology, nodes represent cellular components (e.g. proteins) and edges represent interactions or other relationships between these components [e.g. protein–protein interactions (PPIs)]. Basic network measures can be used to analyse the components of a network, both locally and globally, and facilitate the analysis and extraction of useful information from a biological network. The most elementary characteristic of a node is its degree, i.e. the number of edges connecting one node to its neighbours. The probability distribution of the degrees over the whole network is called degree distribution. In random networks, most nodes have a similar number of links and their degree distribution follows the Poisson distribution. In contrast, many real-world networks, including most biological networks, are scale free. This means that their degree distribution follows a power law, as most of the nodes have few links and only a few nodes are densely connected [7]. Figure 3 Network Basics. (A) The basic elements of a network are illustrated in this simple network where a circle indicates a node and a line indicates an edge. (B) Networks can be either undirected (upper panel) or directed (lower panel). (C) Hubs (red nodes – or dark grey nodes in Black & White printing) and bottlenecks (green nodes – or medium grey nodes in Black & White printing) are illustrated in this sample graph. Two example modules (green and blue areas – or shadowed areas in Black & White printing) are illustrated as subgroups of nodes and their respective edges. Nodes with high degree, known as ‘hubs’ (illustrated in Figure 3C), can be key players in molecular mechanisms such as (i) a protein interacting with multiple other proteins, (ii) regulation of multiple genes by a key transcription factor or (iii) multipart regulation by other regulatory elements [i.e. microRNAs (miRNAs)]. All these cellular processes may be highly significant in determining the outcome or phenotype of a disease of interest. In human cells, hub genes have been found to indicate essential genes (i.e. critical for survival) rather than disease genes [8]. Another node feature is its betweenness, i.e. the extent to which a node participates in the shortest paths connecting other nodes. Nodes with high betweenness, known as ‘bottlenecks’, can be extremely influential in a network in the sense that they rest in critical junctions between hubs and can therefore represent bridges that allow groups of nodes to cross talk to each other (as illustrated in Figure 3C). Importantly, some of these bottleneck nodes represent key connections that if removed will result in the complete loss of connectivity between clusters of nodes, thus affecting greatly the overall topology and, as a result, the information propagation in the network. In molecular terms, an example of bottleneck nodes is that of proteins whose loss of function leads to deactivation of specific processes. In directed regulatory protein networks, betweenness was shown to be a good predictor of essentiality [9]. Various other network features can be calculated to provide insights into biological networks. Another measure is closeness (a measure of the average length of the shortest paths from one node to other nodes), which indicates important nodes that can communicate quickly with other nodes of the network. For example, in a protein signalling network closeness can be interpreted as the ‘probability’ of a protein to be functionally relevant for several other proteins. An example illustrating how network measures, such as network-efficiency and network-clustering, can be used as biomarkers is the recent study of Blain-Morales et al. [10] where a network was constructed using the alpha bandwidth (8–13 Hz) of the electroencephalogram recordings during anaesthesia in healthy humans. Global network efficiency quantifies the efficiency of information exchange across the whole network and is defined as the average inverse shortest path length over all pairs of nodes. The clustering-coefficient is a measure of the degree to which nodes in a network tend to cluster together (the global measure is calculated by averaging the local clustering-coefficients of all nodes). In Blain-Morales et al. [10] network efficiency was significantly decreased and network clustering-coefficient was significantly increased during anaesthesia-induced unconsciousness. These measures returned to baseline 3 h post-recovery, suggesting that they could be used as potential biomarkers for normal recovery brain networks post general anaesthesia induction. Other network measures such as network size [11], density [11], PageRank versatility [12], path length [10] and modularity [10] can further be used to evaluate networks. For an extensive review of network measures, the reader is referred to [13]. Although topological properties from a graph-theory point of view do not always have a clear biological meaning, in many cases they can be good predictors of functional and disease modules (see [8] for further discussion). Biological network construction methods Biological networks can be split into two broad categories that best characterize their underlying nature: (i) evidence-based molecular networks that rely on experimental evidence for specific molecular interactions such as PPI networks, metabolic networks and regulatory networks (transcription factor—gene networks, non-coding RNA—gene networks) [14–18], (ii) statistically inferred networks, which are based on statistical inference that rely on interactions between components established by means of statistical analysis. Evidence-based molecular networks : The information used to build networks of molecular interactions is obtained from small-, medium- or large-scale experimental data that are usually aggregated and available in online databases [19, 20]. A plethora of information can be derived from multiple resources including PPIs, gene regulatory relationships (including miRNAs) and metabolic pathways, using high-throughput (i.e. whole exome sequencing) and literature-curated data. In addition, valuable pre-compiled information can be derived from databases like Gene Ontology [21], REACTOME [22, 23] and literature-based annotations in Genome Recognition Analysis Internet Link [24]. The construction of biological interaction networks with the goal of uncovering causal relationships constitutes a major research topic in systems biology [25]. Many approaches have been developed to study the interactions among a large number of genes to highlight significant genes for each disease. Certain approaches utilize biological knowledge, to address many biological problems and find genes related to the disease of interest. Construction of networks requires knowledge of PPIs, protein–DNA interactions (PDIs) and/or protein–metabolite interactions (PMIs). Such data can be obtained from open-access databases. For example, PPI data can be obtained from the Search Tool for Recurring Instances of Neighbouring Genes (STRING) [26], the Human Protein Reference Database (HPRD) [27], the Biomolecular Interaction Network Database (BIND) [28], the Molecular INTeraction database (MINT) [29] and the Biological General Repository for Interaction Datasets (BioGRID) [30]. For example, in a recent study, a network-based analysis of mass-spectrometry (MS)-based proteomics data of spinal nerves led to the identification of 19 biological processes to be involved in retrograde motoneurodegeneration and neuroprotection after axonal damage [31]. In this study, the authors used nine public PPI databases to obtain protein interaction data. Furthermore, PDI databases include the EdgeExpressDB (FANTOM4-EEDB) [32], the Transcriptional Regulatory Element Database [33], MSigDB [34], MultiNet [35] and the MetaCore [36]. The KEGG pathway database [37] can be used to obtain PMI data. Nevertheless, as a large number of genes are not functionally characterized, these approaches are compromised owing to lack of available data [38]. Based on this limitation, many statistical network inference methods were developed to construct statistically inferred gene networks, based on omic data from high-throughput technologies, as they provide snapshots of the transcriptome under many tested experimental conditions [39]. Statistically inferred networks: A type of statistical inference network is the ‘co-expression network’, where genes are connected based on statistically significant correlated or anti-correlated (depending on the underlying question) expression profiles with respect to a disease of interest. Another type of statistically generated network is the ‘genetic network’ [40–42]. Sources like the BioGRID [43] database, allow researchers to investigate how the dysregulation of one gene affects the downstream response of another gene and, moreover, how this cascade of molecular functions influences specific disease phenotypes. The basic idea behind the network inference methods is to search for sets of co-expressed genes. Depending on the metric that is used, these methods can be classified into three major categories [38]: (i) Mutual Information-based methods, (ii) Correlation-based methods and (iii) Tree-based methods. Mutual information-based methods calculate the mutual information values of all pairs for a given gene expression profile, and if a pair’s corresponding value is larger than a given threshold then this pair of genes is considered as linked. The resulting network is constructed based on this threshold by including a weighted edge between two genes [44]. The weight can be calculated with several algorithms: ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [45], CLR (Context Likelihood or Relatedness Network) [46], MRNET (Maximum Relevance Minimum Redundancy) [47], MRNETB (Maximum Relevance Minimum Redundancy Backward) [47] and C3NET [48]. In the case of correlation-based methods, the different algorithms calculate the correlation or the partial correlation between pairs of genes. These methods are implemented through algorithms like GeneNet, a statistical learning algorithm which allows the assessment of Graphical Gaussian Models [49], and Weighted Correlation Network Analysis (WGCNA) [50], an algorithm which calculates correlations across each pair of genes. It computes an adjacency matrix using the Spearman correlation, Lasso—a shrinkage and selection method for linear regression [51]—and Adaptive Lasso—another version of Lasso modified to include penalty weights [52]. In the case of tree-based methods, algorithms use tree-based ensemble methods as feature selection techniques to solve a regression problem for each gene in the network. More specifically, the basic idea of tree-based methods in regression is to recursively divide the learning sample with binary tests based each on one input variable (the expression of one gene). These binary tests are optimized to minimize, in the largest amount possible, the variance of the output variable, namely, the expression of another gene from the remaining in the subsets of samples. Candidate divisions compare values from the input variable with a threshold, which is determined during the tree growing. Tree-based ensemble methods are more enhanced than single trees, as they estimate the average predictions of several trees. An example of a tree-based method is the GENIE3 algorithm [53], which emerged as the best performer in a significant network inference challenge [39]. In summary, statistical-inference methods are used to estimate the expression pattern relationships across all pairs of genes driving to co-expression network inference. However, correlation-based methods have a tendency to be algorithmically straightforward and computationally fast, but with the limitation that assume linear relationships among variables. In contrast, methods based on mutual information capture non-linear as well as linear interactions but they can be computationally expensive. From another point of view, tree-based methods are non-parametric and consequently, they do not need to make any assumption about the nature of the data. Tree-based methods can deal effectively with high-dimensionality data. Module-based approaches and network signatures Representing high-throughput data with networks often leads to complex and highly dense networks that cannot be easily interpreted by the human eye. To extract biologically meaningful information from these networks and establish links to disease, methods have been developed for scanning and parsing these networks. These methods allow for significant sub-networks to be highlighted in the sea of nodes and edges, often representing important ‘modules’ that are associated with a specific disease [8, 54, 55] (as illustrated in Figure 3C). This type of network ‘traversing’ can be performed between networks obtained from data from different phenotypes from the same disease (staging, subtyping) or from similar diseases (disease hierarchy). Through this way, common/shared modules can be identified by looking at network intersections. Alternatively, unique modules reflecting molecular signatures exclusive to specific conditions or phenotypes can be extracted. Depending on the integration mode, the identification of modules may lead to ‘active modules’ (by integrating molecular profiles and highlighting activity of nodes/interactions), ‘conserved modules’ (by comparing multiple species/states and concluding to conserved subnetworks), ‘differential modules’ (by comparing states and conditions and concluding to differentiated subnetworks) and ‘composite modules’ (by integrating multi-source information from complementary networks) [56]. Module identification can be performed using Systems Bioinformatics approaches and constitutes a powerful tool for delineating the systematic molecular basis of disease. Module identification thus abides by two assumptions: (i) modules that are specific to a disease of interest are expected to form dense clusters or hubs capable of detection by unsupervised network clustering algorithms, (ii) the functional relationship between the nodes residing within these clusters is expected to be similar with respect to underlying molecular mechanisms, biological processes and cellular/tissue localization [57]. Further functional significance of modules can be derived by using pathway enrichment analysis methods, either in the traditional sense (Fisher’s enrichment analysis) or by prioritizing/ranking pathways and genes based on network topological similarities to already validated disease network components. The latter case leads to another type of module functional annotation and assignment based on associations and connection to neighbouring nodes that are of known/validated functions. This approach has been used to elucidate functional and clinical significance of hazy areas of molecular networks for which associations between genes or proteins and specific disease phenotypes are not available in publicly available resources or through basic high-throughput analyses. There are several different types of example cases that have used network methods to gain information on functional modules and network signatures. Novel information for genes, pathways and other molecular interactions involved in numerous disorders has been discovered using network module-based approaches. These include disorders such as type 2 diabetes mellitus [58–61], Alzheimer’s disease (AD) [62, 63], Parkinson’s disease [60, 64], cardiovascular diseases, asthma [61, 65–67] and a variety of tumours [68, 69]. When existing bioinformatics resources and databases fall short in shedding light on a specific disorder of interest, novel experiments have to be designed and conducted to successfully unravel the clinicopathological, genetic and molecular mechanisms underlying disease. Success stories include studies for spinocerebellar ataxia [70], Huntington’s disease [71] and schizophrenia [72–74]. For example, a recent study utilized large-scale expression data to extract/identify biologically significant modules from gene expression networks [75]. It has been hypothesized that disease tissue specificity is governed by the expression of a specific functional disease module (sub-network) in the tissue of disease manifestation [75]. The authors of this study adopted this hypothesis and used systems approaches to investigate the tissue-specific expression patterns of disease genes in the human interactome. They observed that genes expressed in a specific tissue shared a topological neighbourhood in the human interactome network, in contrast to genes expressed in different tissues. The authors further provided evidence that expression of all components of the tissue-specific disease module was necessary for determining the disease outcome. The construction of this tissue-specific disease network further allowed for predictions on novel disease–tissue relationships. Network controllability Network controllability is defined as the potential to steer a network from a given initial state to a final desired state within a finite time and with appropriate inputs/modifications. Such modifications are also known as ‘network attacks’. Another Systems Bioinformatics’ emerging research direction related to the dynamic properties of complex networks is the way in which these properties are spreading and/or reforming during network attacks. The attack process is usually based on specific mathematical models that decide which nodes-edges or even specific hubs to remove, to examine issues like controllability [76], error tolerance [77], attack vulnerability [77, 78], robustness [79, 80], topological characteristics [81] and control centrality [82]. Cascade attacks [83], degree and betweenness-based attacks and prominence-based attacks are commonly used types of attacks [84]. It has been shown that intentional attacks as well as random failures may easily affect (or even destroy) network functions such as connectivity and synchronization [84, 85], while a small range of node failures can affect the network controllability [86]. In the case of Systems Bioinformatics there are relatively limited, yet of great interest, studies that have used such an approach [87]. ‘Driving nodes’ are highly important nodes in a network that governs its controllability. Control theory shows that to direct a complex network towards a desired state, there is a minimum number of driving nodes. Determining the minimum number of driving nodes can be demanding with respect to both computational resources and time, hence novel non-exhaustive algorithms for determining driving nodes are necessary. A recently published study [88] presents the actuation spectrum method that optimizes the trade-off between driving node prediction and time. The authors validate their methodology across numerous complex networks and show that a small number of driving nodes are sufficient to determine the state of a complex network. Another approach makes use of PPI networks and network controllability. By controlling the structure of human PPI networks, using the correct queues or inputs, it is possible to activate specific cellular processes that determine disease outcome (i.e. apoptosis). A recent study [89] utilized a PPI network of 6339 proteins and 34 813 interactions to perform classification of proteins with respect to their importance in the network. The authors quantified the effects of removing a specific protein from the network by calculating the number of remaining driving nodes. Results showed that the most important proteins according to this analysis were also the primary targets of disease-causing mutations, human viruses and drugs. This study showed that controllability of a network can provide crucial information for the shift between healthy and disease states, at the same time highlighting novel candidate drug targets. Network integration A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers (Figure 2). Different disease modules have been shown to act in synergy. Thus, to obtain a holistic picture of the complex mechanisms that underlie disease manifestation, it is necessary to construct networks of integrated disease modules. Such an integration can be achieved via various ways: (i) By investigating gene association it is possible to construct connections between different disease modules by looking at shared, common genes. This can reflect the genetic basis of diseases and provide associations between diseases of a common genetic background. (ii) Superimposing gene networks with gene expression data from RNA-Seq or microarray analysis can further enrich these networks of disease modules. (iii) A common genetic basis for disease modules can also be established by analysing genetic variants or polymorphisms (i.e. SNPs, indels). Networks of disease modules sharing genetic mutations can lead to important findings such as establishment of linkage and associations of variants as well as environmental factors with disease modules. (iv) Protein interaction network modules can also be merged using common PPIs between disease modules and moreover, overlaying this information with proteomics expression data can provide valuable insights into the proteome of diseases of interest. (v) Looking at common pathways between disease sub-networks can also provide valuable clues as to similarities and/or differences between diseases of interest. (vi) Metabolic pathways can also provide additional information towards the understanding of enzyme catalytic activity for different disease modules. Disorders that affect specific metabolic pathways (i.e. obesity) are more likely to share commonalties in the metabolic networks than diseases that share a genetic basis. (vii) Disease modules can also be linked using regulatory information such as shared miRNA regulators, thus highlighting important commonalities or differences between diseases. Specific cellular components (or modules) associated with a disease are believed to share a topological neighbourhood within the human interactome [90]. In a recent study [90] the authors utilized novel mathematical conditions to map the topological relationships between diseases in the human interactome. They showed that diseases with common expression profiles, symptoms and comorbidity share overlapping modules in contrast to more phenotypically distinct diseases, which appear in distant topological neighbourhoods. These tools can provide valuable insights in predicting drug therapy for diseases with common phenotypes, even if they are genetically distinct. Another recent study [91] adopted a novel, multiple-network-framework integration for epigenetic modules. This method utilized the Epigenetic Module based on Differential Networks (EMDN) algorithm, which simultaneously analyses DNA methylation and gene expression data [91]. Using The Cancer Genome Atlas (TCGA) breast cancer data, the authors reported that the EMDN algorithm could recognize positively and negatively correlated modules. These modules can serve as biomarkers to predict/diagnose breast cancer subtypes by using methylation profiles, where positively and negatively correlated modules are of equal importance in the classification of cancer subtypes. The authors of this study also showed that epigenetic modules also estimate the survival time of patients, and this factor is critical for cancer therapy. Tools that analyse the structure and topology of these integrated networks are of extreme value and can provide insights into the synergistic role of multiple network components in diseases of interest. The methods for network analysis and integration we have discussed so far are mainly used to describe the topology of a biological network (or a set of networks). Although these methods capture the relationships between components, they fail to capture the dynamics, i.e. the time component is not modelled and, thus, simulations to obtain prediction of the evolution of the system cannot be performed. For example, static insights into the molecular basis of a disease do not provide a complete picture with regards to drug response without access to time-dependent data. Hence, the use of mathematical algorithms and computational tools for modelling the dynamics of these networks complements network analysis and is further detailed in the following section. Systems modelling and simulation To test the validity and predict the behaviour of complex biochemical systems, such as gene networks, it is often required to describe the effects of multiple, simultaneous and dynamic interactions within the components of the system that are too complex to interpret intuitively. Developing and simulating mathematical models is essential in investigating such complex biological systems. These complex systems can be further explored using mathematical models to describe the valid structure (i.e. the components of the system and their interactions based on experimental data) and identify the basic underlying principles of their function to predict behavioural responses to a certain perturbation [92]. Two types of models are commonly used to describe biological processes such as gene networks—‘quantitative’ and ‘logical’ [93, 94]. Quantitative models use differential equations to describe the non-linear dynamic interactions in a network, whereas logical models use the Boolean approach to describe dynamics in a qualitative way. Quantitative models provide precise information and can be directly compared with experiments including time-dependent data. However, they require sufficient knowledge of the mechanistic details and kinetic parameters and, thus, they are limited to applications to networks which are well characterized and are of small to moderate size. Logical models do not require such information and can be applied to large-scale networks with known structure, yet only provide limited information, as they cannot provide quantitative predictions and assist in choosing better alternative behaviours. In summary, each modelling approach has its advantages and disadvantages and recent work suggests that hybrid approaches might be optimal for challenges in systems biology (for a detailed discussion see [93]). Mathematical models are indispensable in pharmacology and diagnostics. For example, spatio-temporal mathematical models of the blood coagulation network have been developed to aid drug development and diagnostics (as extensively reviewed in [95]). Another relevant application is the use of mathematical models of drug-targeted pathways (modelled with a set of differential equations based on the mass action law) to explore drug combinations [96]. Classical bioinformatics and systems biology can complement and strengthen each other in drug discovery and therapeutics where concrete predictions are required [92]. The value of combining high-throughput data with mathematical modelling is shown, for example, in devising personalized treatments in cancer (for extensive review see [97]). Integration of multi-omic data can be used as an additional constraint in constraint-based modelling in systems biology (to optimize parameter estimation and validation). For a recent survey summarizing constraint-based metabolomic modelling and multi-omic integration methods see [98]. Infrastructures and data management It is important to highlight some of the modern computing trends that play an important role in driving research in Systems Bioinformatics and facilitate the transition to personalized medicine. The main limiting factor for research laboratories specializing in Systems Bioinformatics is computational power and resources. Significant investment is required to attain high-performance computer (HPC) servers or clusters, which have the capacity to store, manage and process the vast amount of data generated from high-throughput omics technologies. Often sheer maintenance of these machines can be a costly and a limiting factor that disallows the exploitation of the full potential of HPC. Cloud computing promises to solve major issues of system administration for these computer clusters by allowing for the exploitation of HPC, stored and managed in an expert environment, as virtual resources that are made available through the internet. Tool availability is also a major issue and having organized platforms with tools like CytoScape [99], GATK [100], BLAST [101], omics assemblers (i.e. IDBA-UD [102]) and programming languages and packages like R's Bioconductor library for expression data analysis [103], JAVA, Python, SQL and others is of major importance for scientists to facilitate dissemination of algorithms, data and results. Systems bioinformatics applications In this section we present the impact of Systems Bioinformatics on diagnostics and therapeutics by highlighting success stories and cases in a formatted manner: introductory text/data set collection/network construction/network analysis/findings and significance of research. Systems bioinformatics applications in biomarker discovery The use of networks in computational diagnostics via the detection of molecular biomarkers is one of the hallmarks of Systems Bioinformatics. Numerous recent state-of-the-art studies have made use of such networks to characterize cellular systems by simultaneously analysing thousands of genes, proteins, isoforms and complexes to address issues of computational diagnostics. Here, we highlight a few studies, showcasing the essence of networks’ contribution in precision diagnostics. A recent study [62] used a machine learning approach, which integrates topological features from PPI networks, to identify candidate AD-associated genes. Data set collection: Positive and negative data sets were collected from Entrez Gene database at the National Centre for Biotechnology Information (NCBI). The positive data set consisted of 458 genes known to be associated with AD. The negative data set consisted of the additional 55 947 Entrez genes, excluding the AD-associated genes. Network construction: Human PPI data sets were extracted from a variety of sources including Online Predicted Human Interaction Database (OPID), STRING, MINT, BIND and InTAct databases. Network analysis: By utilizing the PPI networks, the authors extracted topological features for the AD- and non-AD-associated genes. These features included nine topological properties of the PPI network for each gene, namely, the average shortest path length, betweenness centrality, closeness centrality, clustering coefficient, degree, eccentricity, neighbourhood connectivity, topological coefficient and radiality. Findings and significance of research: The authors further combined sequence features and functional annotations features and concurrently performed feature selection using seven methods including gain-ratio-based attribute evaluation, oneR algorithm, chi-square-based selection, correlation-based selection, information gain-based attribute evaluation and relief-based selection. The most important features were fed into 11 machine learning algorithms to generate classifiers using the training data set capable of predicting AD- and non-AD-associated genes using the selected network, sequence and functional features. Methods included Naive Bayes (NB), NB Tree, Bayes Net, Decision table/NB hybrid classifier, Random Forest, J48, Functional Tree, Locally Weighted Learning (J48 + k-nearest neighbour), Logistic Regression and Support Vector Machine. Training of sophisticated machine learning classifiers using systemic properties can be a key feature in generating personalized medicine diagnostic approaches. The authors finally combined diagnostics with therapeutics by screening 45 known anti-Alzheimer drugs from DrugBank against novel predicted probable AD targets, obtained from their trained classifiers, using molecular docking. They further proposed a novel candidate untried drug, AL-108, with high affinity to potential therapeutic targets. Additional tools were also used to validate preliminary findings, including molecular dynamics simulations and MM/GBSA calculations on the docked complexes [62]. Another interesting study [104] used data from TCGA [105] to successfully construct a multidimensional subnetwork atlas for cancer prognosis. The authors addressed how multiple genetic and epigenetic factors (i.e. gene expression, copy number variation, miRNA expression and DNA methylation) affect molecular states of networks and patient survival. Dat aset collection: The multidimensional cancer-associated data sets for 1027 patients for four cancer types were collected from TCGA Cancer Browser (https://genome-can cer.ucsc.edu/proj/site/hgHeatmap/). They contained clinical information, copy-number variation, promoter DNA methylation, mRNA-gene and miRNA expression data. They furthermore extracted PPI data from HPRD for network construction. To enrich these networks with additional miRNA-regulatory information the authors extracted miRNA and target gene information from two miRNA target databases [miRTarBase (Release 4.5) and TarBase v6], which provide experimentally validated miRNA–target interactions. Network construction: PPI interaction network was constructed using data collected from HPRD. Network analysis: The authors fitted a univariate Cox proportional hazards model between each molecular feature and patient survival time and thus scored each gene based on its significance to predict survival. Genes with a positive score were considered as survival-related genes. They next used this score (heat score) as the input into HotNet2, which uses a heat diffusion process and a statistical test-based algorithm to discover subnetwork signatures in the PPI network. Through this way subnetwork signatures of survival-related genes were determined both by the scores of their genes as well as gene topology in the PPI network. Findings and significance of research: The authors then used Monte Carlo cross-validation and permutation testing procedure to assess predictive power of the subnetworks on patient overall survival. They used a Cox proportional hazards model with L1 penalized log partial likelihood (LASSO) for feature selection to train the models based on the molecular profile of individual subnetworks. Finally, the prognostic outcomes for the training set were used to determine the regression coefficients. These coefficients were then used in the testing model to predict outcomes for patients in the test set and calculate the concordance index (C-index). Results reveal novel PPI subnetworks with significant prognostic capabilities for a variety of cancer types. The authors further validated their subnetworks by performing prognostic impact evaluation, functional enrichment analysis, drug target annotation, tumour stratification and independent validation. They highlighted distinct pathways in the underlying subnetworks as potential new targets for therapeutic intervention for certain cancer types. This study integrated the protein interactome with cancer genomics data, thus allowing for a systemic analysis of the molecular mechanisms that underlie genesis of cancer and provides new directions in personalized cancer therapy [104]. Another recent study [106] adopted an approach that uses an enriched library of single-stranded oligodeoxynucleotides to profile complex biological samples. This method allows for the analysis of systemic native biomolecules. The authors defined their method as Adaptive Dynamic Artificial Poly-ligand Targeting and further utilized it as a diagnostic tool to profile plasma exosome of cancer patients. They achieved high classification accuracy in breast cancer patients by analysing the circulating exosomes in their blood [106]. The online database MelGene is yet another example of successful integration of Systems Bioinformatics approaches in current research for molecular diagnostics. This tool provides a comprehensive, regularly updated collection of data from genetic association studies in cutaneous melanoma, including random-effects meta-analysis results of all eligible polymorphisms [107]. The MelGene proposed network connections highlight potentially new loci in relation to melanoma risk. Recent studies have shown that interpretation of proteomics data using network-based approaches can offer additional insights into the mechanistic and dynamics of protein assemblies, and hence into the molecular mechanisms underlying the system under study. Moreover, network-based approaches can be used to reconstruct a disease-perturbed cellular network model showing the interactions of identified differentially expressed proteins involved in selected cellular pathways related to the target pathophysiology. For example, Shirasaki et al. [108] have used affinity purification coupled to MS to investigate the proteome profile of Huntington’s disease. In particular, using a monoclonal antibody against huntingtin (Htt), they identified 747 proteins to be complexed with Htt. A systems-level view of Htt interactome was achieved by using WGCNA, which was used to construct weighted links between the Htt co-purifying proteins. Using topological overlap, the data were clustered into eight Htt-interactome modules that were related to distinct functional aspects such as brain region specificity, aging and protein aggregation modulation or Htt functions directly [108]. Moreover, several network-based approaches have been developed that can identify the cellular pathways which are altered under pathophysiological conditions, and can hence enrich biomarker discovery. For example, functional enrichment analysis of GO biological processes or KEGG pathways [37] of differentially expressed proteins can be performed using both free licence tools such as Database for Annotation, Visualization and Integrated Discovery (DAVID) [109], Protein ANalysis THrough Evolutionary Relationships (PANTHER) [110] and Gene Set Enrichment Analysis (GSEA) [34], as well as commercialized tools such as MetaCore [36] and Ingenuity Pathway Analysis. Furthermore, pathway topology approaches have been developed as alternative to enrichment analysis. For example, Signalling Pathway Impact Analysis [111] and Network Perturbation Amplitude [112] deliberate whether the proteins involved in functional modules defined by other databases interact with each other in cellular networks. Various tools are currently available, which can aid the Systems Bioinformatics application in biomarker discovery. GWAB, a recent tool, makes use of systems approaches and computational methods to boost weak association signals for Genome Wide Association Studies (GWAS), a common problem when analysing this type of data. This tool works by incorporating publicly available data in the form of using GWAS summary statistics (p-values) for SNPs along with reference genes for a disease of interest. The authors demonstrated the feasibility of boosting GWAS disease associations using gene networks and further present a web server for GWAB, for the network-based boosting of human GWAS data [113]. Other tools like GeneMANIA [114] and PINTA [115] allow for gene prioritization and gene function prediction and can greatly aid in computational diagnostics. Systems Bioinformatics applications in drug discovery Systems Bioinformatics contributes in computational therapeutics by providing tools and algorithms for novel drug discovery. Research in this direction is often done in close collaboration with pharmaceutical companies. One of the main challenges faced by both the research community and the industry is the prediction of adverse drug effects, especially during the early stages of drug development. These types of predictions can lead to significant cost reductions by allowing for accurate drug assessment and discontinuation of development for drugs with severe adverse effects. The use of human genetic variation has been known to play an important role in drug response [116]; however, the effect of this factor alone is not sufficient to provide a complete perspective on the matter in hand. Systems pharmacology is a term that is widely used today in many high-calibre, recently published studies [117–119]. Systems pharmacology is a systems biology approach, which focuses on enhancing the understanding of drugs function in the human body at a systems’ level, described by several types of networks, rather than looking at the effect of single molecular components. It shifts away from traditional practice, which considers the effects of a drug with respect to its target protein and instead strives to address the effects of the drug by considering a network of drug–target interactions. Systems Bioinformatics is a precious field in the neighbourhood of systems pharmacology that provides important methods and tools for multi-source and multilevel integration of the omics spectrum with drug networks shedding light in the area of modern drug discovery. A recent area of great interest where Systems Bioinformatics can be of substantial impact and value is the area of drug repurposing or repositioning [120]. This entails the use of Food and Drug Administration (FDA)-approved drugs to treat new diseases, which are different from the ones they were initially designed for. This allows for obvious shortcuts for pharmaceutical companies allowing them to by-pass the timely and costly process of FDA approval for novel drugs. Recent studies used gene expression data derived from microarrays or RNA-Seq data to obtain specific expression profiles for specific diseases of interest. By comparing these to collections of data sets from repositories such as CMap [121], Drugmap Central and more advanced versions like LINCS and the recent Drug Repurposing Hub [121, 122] allows for alternative drugs to be proposed for the treatment of diseases under investigation. In a recent study [38], this approach was used to devise drug/target networks obtained from algorithms of mutual information and co-expression networks aiming to gain insights into the treatment of breast cancer subtypes. Data collection: TCGA mRNA (microarray) gene expression data for Breast Invasive Carcinoma cases were obtained from Firehose (http://gdac.broadinstitute.org/). From a total of 587 samples (526 primary solid tumour samples and 61 primary solid normal samples—17.814 genes), the authors selected a subset of tumour data containing information regarding breast cancer staging, HER2, ER and PR status with their corresponding normal samples as well as breast cancer stages I, II, III and IV. Network construction: The authors examined three major categories of statistical network inference methods: (i) mutual information-based methods, (ii) correlation-based methods and (iii) tree-based methods. They further utilized Biological information-based network methods and one ensemble scheme using all statistical network inference methods. They used the Cytoscape platform and more specifically the GeneMania plug-in to reconstruct the biological information-based gene network. This plug-in uses a large data set unifying functional networks comprising approximately 800 networks for six organisms including Homo sapiens. Using the H.sapiens network they constructed a sub-network for the top 1000 differentially expressed genes (DEGs) from the TCGA data set merging five Network types: Co-expression, Physical Interaction, Genetic interaction, Co-localization and Pathways. Network analysis: The authors further performed gene re-ranking using the underlying networks. To investigate the influence of the reconstructed 17 gene networks (12 statistically and 5 biologically inferred) on gene prioritization, they applied a method that allows for a custom network selection combining the log fold change absolute values with the selected underlying network topology to re-rank the initial DEGs. The basic idea of the method is the reconciliation of the gene expression values taking into account the underlying gene network topological features such as degree and betweenness. The network patterns were further analysed to investigate their exclusive contribution with respect to breast cancer subtypes and stages. The authors then performed drug repurposing using the up- and down-regulated genes forming disease signatures by querying them in a well-established drug repurposing pipeline, namely, LINCS-L1000 (http://www.lincscloud.org/), an advanced version of CMap. In summary, the authors obtained 63 unique drugs for the breast cancer stages and 58 for the breast cancer subtypes. To further examine the resulting drugs, the authors constructed super networks by combining top drugs extracted from their analysis with the FDA-approved breast cancer drugs, connecting them with their target genes and superimposing these on the gene expression networks. Findings and significance of research: The authors performed an analysis that concluded to eight network patterns, four for the stages (I, II, III and IV) and four for the subtypes (Triple Negative, Luminal A, Luminal B and HER2). These patterns were shown to highlight four exclusive stage-related pathways including phenylalanine metabolism for Stage II, peroxisome proliferator-activated signalling pathway and glycolysis and gluconeogenesis for Stage III and toll-like receptor signalling pathway for Stage IV. Finally, the authors performed drug repurposing to elucidate potential anti-breast-cancer properties for known drugs and they compared the molecular structure for their predicted re-purposed drugs against 25 FDA-approved drugs of clinical use. Two out of these 25 drugs (Gemcitabine and Palbociclib) were also found as repurposed drugs by the authors. In Stage I, two repurposed drugs, Clofarabine and Kinetin-riboside, were found to be structurally similar to Gemcitabine. Clofarabine seems to have potential efficacy in epigenetic therapy of solid tumours, especially at early stages of carcinogenesis. Another recent line of work [64] performed network-based in silico drug efficacy screening by exploiting network-based approaches. The authors investigated the association between drug targets and diseases, presenting a drug–disease proximity measure [64]. Data collection: The authors used 1489 diseases defined by Medical Subject Headings (MeSH) compiled in a recent study [90]. For each disease, the disease–gene associations were collected from OMIM and GWAS catalogue. For each disease, the authors extracted information on FDA-approved drugs from DrugBank and matched 79 of these diseases with at least one drug using tools like MEDI-HPS and Metab2Mesh resulting in 238 unique drugs and 384 targets. The authors took information published by [90] that contained experimentally documented human protein physical interactions from TRANSFAC, IntAct, MINT, BioGRID, HPRD, KEGG, BIGG, CORUM, PhosphoSitePlus and a large-scale signalling network. Network construction: The human PPI network was compiled using information extracted from the databases described above, to generate an elaborate human interactome. The largest connected component of this interactome was consequently used in their analysis, consisting of 141 150 interactions between 13 329 proteins. Entrez Gene IDs were used to map disease-associated genes to the corresponding proteins in the interactome. Network analysis: The proximity between a disease and a drug was evaluated using various distance measures that take into account the path lengths between drug targets and disease proteins. The authors focused on two types of network-based proximity relationships between drugs and disease proteins: (i) the most straightforward measure is the average shortest path length between all targets of a drug and the proteins involved in the same disease; (ii) the second proximity is the closest measure, representing the average shortest path length between the drug’s targets and the nearest disease protein. Findings and significance of research: The authors validated their approach and optimized their proximity thresholds by assessing how well relative proximity discriminates 402 known drug–disease pairs from the 18 162 unknown drug–disease pairs by comparing the area under Receiver Operating Characteristic curve for different distance measures. Based on these results the authors showed that network proximity delineates therapeutic effects of a drug. This approach of utilizing network proximity in the interactome for drug targets and diseases, allowed for increased understanding in the therapeutic effect of drugs. They made use of cases from Parkinson’s disease and several inflammatory disorders to further substantiate findings. This approach can potentially have significant applications in drug discovery, drug repurposing and assessment of drug adverse effects. Another study [123] led to the development of a current state-of-the-art tool that addresses computational therapeutics from a network perspective, the TCM-Mesh system. This tool allows for the high-throughput network pharmacology analysis for Traditional Chinese Medicine (TCM) [123]. Data set collection: TCM utilizes data curated from collections of 6235 herbs, 383 840 compounds, 14 298 genes, 6204 diseases, 144 723 gene–disease associations, 3 440 231 pairs of gene interactions, 163 221 side-effect records and 71 toxic records (data as of April 2017). The information for traditional Chinese herbs and traditional Chinese medicine preparation was extracted from TCM Database@Taiwan, TCMID, information of compounds and their targets; diseases and their related proteins were obtained from STITCH and OMIM, respectively; the protein interactions were obtained from STRING; the toxic and side-effect records of compounds were derived from TOXNET and SIDER. Network construction: The authors used Cytoscape as well as a web-based software to facilitate visualization of a compound–gene–disease network construction between TCM and treated diseases. Network analysis: The authors based their network analysis and scored their compounds using the combined score as defined and obtained from the STITCH database. This score represents the strength of the links between the compounds and their associated proteins. Findings and significance of research: The authors used 1293 FDA-approved drugs, as well as compounds from a herbal material Panax ginseng and a patented drug Liuwei Dihuang Wan for evaluating their database. By comparison of different databases, as well as checking against literature, they demonstrated the completeness, effectiveness and accuracy of the TCM-Mesh database and further aided in increased understanding of the molecular mechanisms of TCM action. Various tools are currently available, which can aid the Systems Bioinformatics application in drug discovery. For example, tools like Substructure-Drug-Target Network-Based Inference SDTNBI [124], C(2) Maps [125], Chem2Bio2RDF [126] and PROMISCUOUS [127] cumulatively provide integrated systems and pharmacology databases for chemoinformatics analysis, drug-target prediction, networks of disease–gene–drug connectivity relationships as well as drug repositioning analysis. For a full list of tools and databases adopting or supporting Systems Bioinformatics methodologies, see Table 1. A more comprehensive list of related tools and databases going back to 2010 can be found in Supplementary Table S1. Table 1 Tools and databases for systems bioinformatics approaches in therapeutics, diagnostics, network visualization/analysis, integration and systems modelling Discussion The concept of utilizing networks to visualize the complex interaction of mechanisms implicated in disease has been around for several years. However, two important breakthroughs separate previous network-based approaches and are currently driving the state-of-the-art in Systems Bioinformatics: (i) construction of multiple networks representing each level of the omics spectrum and the integration of these in a layered network that exchanges information within and between layers [62, 67, 97] and (ii) the advent of novel techniques and methodologies for analysing and understanding these networks using mathematical algorithms and approaches derived from graph theory and information theory [6]. Using these methods for extracting biologically meaningful information from multiple levels of the omics spectrum can provide the integrated systemic knowledge for the development of a comprehensive Human System profile, which increases diagnostic accuracy and concurrently allows for novel therapeutic advances and assess response to therapy. Nevertheless, the network-based approaches, either for evidence-based or for statistically inferred molecular networks, have a number of limitations. Specifically, networks based on experimental evidence are not complete as experiments are only a snapshot of the real biological world. Moreover, statistically inferred networks represent an undetermined computational problem because the number of the inferred relationships is much larger than the number of the independent measurements [196]. Owing to the lack of sufficient ground truth to validate the reconstructed molecular networks, special attention must be given when choosing benchmark data sets (e.g. existing curated databases with experimental information, medium-throughput experimental information and simulated data sets that mimic real data). To this extent, the DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative facilitates researchers from the Systems Bioinformatics field to assess the validity of the networks they are using and proceed with optimization and parameter-tuning regarding network reconstruction [197]. A subsequent limitation of the network-based approaches is the low overlap that the various network reconstruction approaches have, and the inadequacy in selecting the proper network each time. It is likely that computational approaches checking and exploiting complementarities and providing ensemble solutions of a network construction consensus will maximize the information content. In our opinion, Systems Bioinformatics might currently appear rather aspirational, yet, considering its potential it is likely to have a major impact on medicine and pharmacology in the next decade. The field of medicine is expected to benefit from the invaluable knowledge attained from Systems Bioinformatics methodologies. The molecular basis of complex, polygenic diseases is highly heterogeneous and affected by multiple factors simultaneously. These include genetic predisposition, multipart molecular mechanisms and effects of the environment, diet, drug administration/response and numerous other factors. Although it might not be possible to replace the use of traditional approaches for therapeutics and diagnosis with computational methods, yet, it is likely that Systems Bioinformatics will provide revolutionary approaches and tools to clinicians in order to demystify the complex nature of these diseases. Computational diagnostics and therapeutics, enhanced by Systems Bioinformatics approaches, will not only aid clinicians in patient consultation and care but will also catalyse significant breakthroughs in prognostic measures, detection of disease at an early onset and overall disease prevention. Supplementary Material bbx151_Supp Click here for additional data file.

Document structure show

Title	Systems Bioinformatics: increasing precision of computational diagnostics and therapeutics through network-based approaches
Abstract	Abstract Systems Bioinformatics is a relatively new approach, which lies in the intersection of systems biology and classical bioinformatics. It focuses on integrating information across different levels using a bottom-up approach as in systems biology with a data-driven top-down approach as in bioinformatics. The advent of omics technologies has provided the stepping-stone for the emergence of Systems Bioinformatics. These technologies provide a spectrum of information ranging from genomics, transcriptomics and proteomics to epigenomics, pharmacogenomics, metagenomics and metabolomics. Systems Bioinformatics is the framework in which systems approaches are applied to such data, setting the level of resolution as well as the boundary of the system of interest and studying the emerging properties of the system as a whole rather than the sum of the properties derived from the system’s individual components. A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers. Here, we provide evidence on how Systems Bioinformatics enhances computational therapeutics and diagnostics, hence paving the way to precision medicine. The aim of this review is to familiarize the reader with the emerging field of Systems Bioinformatics and to provide a comprehensive overview of its current state-of-the-art methods and technologies. Moreover, we provide examples of success stories and case studies that utilize such methods and tools to significantly advance research in the fields of systems biology and systems medicine.
Title	Abstract
Body	Introduction Biological data, either as large-scale omics or as classical biodata, are the footprints of biological mechanisms. These mechanisms consist of numerous synergistic effects emerging from various systems of interwoven biomolecules, cells and tissues. Therefore, it is necessary to explore them with a systemic approach to reveal the behaviour of the system as a whole rather than as the sum of its parts. Systems biology provides a holistic perspective on biological mechanisms via the integration of information and knowledge from multiple interdisciplinary fields (such as biology, chemistry, mathematics, computer science and physics). It aims to elucidate synergistic relationships between multiple factors in contrast to representing them as single entities and can lead to the generation of complex molecular networks of interactions modelled by computational or mathematical approaches. Systems biology harnesses its power from technological advances in the field of ‘omics’ and the advent of next-generation sequencing. These technologies provide a spectrum of information ranging from genomics, transcriptomics and proteomics to epigenomics, pharmacogenomics, metagenomics and metabolomics. Bioinformatics and computational biology have made significant breakthroughs towards the analysis and interpretation of the data obtained from the above-mentioned omics technologies. The sheer size of data generated by these high-throughput methodologies, coupled with the need to analyse, integrate and concurrently interpret this avalanche of information in a systemic way, has paved the way to the upcoming field of Systems Bioinformatics. Systems Bioinformatics is a relatively new approach, which lies in the intersection of systems biology and classical bioinformatics. It focuses on integrating information across different levels using a bottom-up approach—as adopted in systems biology—with a data-driven top-down approach used in bioinformatics. The bottom-up approach in systems biology typically brings together information from molecular cells and tissues in the framework of mathematical models to generate insights on the function and dynamic behaviour of cells, organs and organisms. The top-down approach uses bioinformatics methods to extract and analyse information from ‘omics’ data generated through high-throughput techniques. Initially, informatics approaches to systems biology focused mainly on modelling and simulation. However, owing to the lack of sufficient experimental data, these methods fell short in building reliable models. Subsequently, the explosion of multilevel data generation brought a plethora of new methods, tools and solutions capable of studying systemic properties. The application of systemic approaches such as information theory, statistical inference, probabilistic models, graph theory and further network science approaches in the analysis of biological data paved the way to the creation of a distinct field, namely, Systems Bioinformatics. Depending on the availability, the quality and the comprehensiveness of the data, Systems Bioinformatics’ methods contribute significant benefit in narrowing down the gap between genotype to phenotype as well as providing additional information regarding biomarker and drug discovery. These methods are applied to classical biological data, clinical/patient data and omics data as well. They are suitable for extracting precise and personalized results, thus, facilitating systems medicine (medicine that is in bidirectional interaction with computational multiscale analysis and modelling of disease-related mechanisms) and more specifically P4 Medicine (medicine that is personal, participatory, predictive and preventive) [1, 2] (as illustrated in Figure 1). Figure 1 Systems Bioinformatics. A schematic representation of the emergence of Systems Bioinformatics as a distinct discipline among other interrelated and interdependent disciplines. The information provided by Bioinformatics, Biology and Systems Biology is integrated in the Systems Bioinformatics framework through computational integration and network-based and other holistic approaches to tackle challenges in Systems Medicine and in particular P4 Medicine. The delivery of individually adapted medical care of high precision, based on multi-source patient information across various levels and in various scales, is the basic idea of modern medicine having various appellations depending on the emphasis given (e.g. translational/systems/P4/precision/personalized medicine). The omics spectrum offers the opportunity and the challenge for multiscale and multi-source analysis towards building a comprehensive profile of the Human System (Figure 2). The major challenges faced by Systems Bioinformatics towards this demanding form of medicine are: (i) the design and development of suitable bioinformatics pipelines to provide valid and sufficient biological information from the high-throughput molecular profiles of the patient, (ii) the development of robust information systems capable for data integration, information extraction and knowledge sharing, (iii) the construction of mathematical models to predict the evolution of a particular disease, its relation with the measured markers, its tolerance/resistance to various drug families and the existing risks to the patient. These challenges can be tackled with state-of-the-art computational methodologies and techniques, such as computational intelligence, machine learning, pattern recognition and data mining, modelling and simulation, network reconstruction and visualization, complex network analysis, deep learning, text mining/semantics and association analysis. Further to these, Systems Bioinformatics serves as the framework for the development of powerful computational methods and tools to create user-friendly platforms to visualize and analyse big and heterogeneous information in the form of a network. Figure 2 Network Integration. Multiscale and multisource data generated from the Human System can be represented in network form. These networks can be further analysed and, importantly, they can be integrated forming supernetworks and building a comprehensive profile of the Human System. This review is structured in three main sections. In the first section on ‘Systems Bioinformatics’ we begin with an overview of the systems theory approach for complex biological problems. We then provide an in-depth summary of the network science approach in System Bioinformatics. We introduce certain basic network measures, which are used to analyse the components of a network, both locally and globally, and discuss the biological interpretation of such measures. We then describe in detail key biological network construction methods followed by a discussion on module-based approaches and network signatures. Finally, we describe network manipulation methods in the ‘Network controllability’ and ‘Network integration’ subsections. In the following subsection we provide a short summary regarding modelling and simulation approaches followed by a short discussion on the infrastructures and data management challenges in the field of Systems Bioinformatics. In the last two sections we provide an overview of methods and case studies with regards to the Systems Bioinformatics applications in biomarker and drug discovery. Systems Bioinformatics Systems approaches Biological data have tremendously expanded both in size and complexity. Systems Bioinformatics focuses on the investigation of such vast and complex biological systems and their within interactions using a ‘holistic’ rather than a ‘reductionist’ approach, much like the systems biology field. A holistic approach to science and the analysis and description of a complex phenomenon emphasizes the whole and the interaction of its parts, whereas the reductionist approach focuses on the fundamental parts. In fact, the debate on reductionism versus holism has its roots in ancient years. According to reductionism proponents, the optimal method to understand any science is the decomposition in smaller components. Moreover, in its greedy form, reductionism may see the whole science as physics. Even in its layered-model form, reductionism considers human/health sciences as based on biology, biology based on chemistry and chemistry based on physics. On the other hand, a strong dissent has formulated a solid antireductionism trend. This trend has either epistemological or ontological origins, supporting that complete reductionism is technically impossible and that there are emergent laws that govern the system and cannot be derived from the laws governing the components of the system. Furthermore, it is supported that each system has a ‘buffering capacity’ where many micro-states correspond to fewer macro-states of the system, making reductionism to be considered as pointless after a certain decomposition level [3]. The reductionism’s approach in biology is epitomized by molecular biology, which in the past two decades has led to the generation of a plethora of omics data. These data provide information on the building blocks of the entire organism at different scales and for different types of cells, tissues and organs. Data on DNA fragments, genes, RNA fragments, peptides, proteins and metabolites measured in short time and space intervals provide a spatiotemporal distribution of these building blocks under various states of the organism. These interwoven building blocks control and are controlled by signals in a non-linear way. As a result, the understanding of the system requires something more than simply the bottom-up assembly of the system’ components. Systems theory, which is a holistic approach, addresses the limitations from the reductionism’s point of view by considering the system as a whole, adopting a top-down approach [4]. Thus, it studies the emergent properties of the system such as homeostasis, adaptivity, tolerance, stability and modularity, through some basic overlying hierarchical principles such as entropy, positive and negative feedback control [4]. In the context of Systems approaches, the graph theory and the further science of networks have been successfully applied to the investigation of complex phenomena across a range of different scientific disciplines. The theoretical context of complex networks approach includes concepts that are derived from information theory, dynamical systems, statistical physics and topology approaches, as well as several mathematical methods suited for the analysis of the interaction of components in a complex network. In the following subsection we provide an in-depth overview of such network approaches and examples of their use in Systems Bioinformatics. Networks Biological network basics Casting biological systems as networks and analysing their topology can be useful in understanding how such systems are organized. Graph theory provides a powerful mathematical framework for the understanding of the organization of such large and complex systems by considering them in the form of graphs [5, 6]. Graphs, also termed as networks, can be used to model the pairwise relations between objects. A network is a collection of nodes or vertices connected by edges, arcs or lines (as shown in Figure 3A). It may be undirected, meaning that there is no distinction between the two nodes associated with each edge, or its edges may be directed from one node to another (as shown in Figure 3B). In cell biology, nodes represent cellular components (e.g. proteins) and edges represent interactions or other relationships between these components [e.g. protein–protein interactions (PPIs)]. Basic network measures can be used to analyse the components of a network, both locally and globally, and facilitate the analysis and extraction of useful information from a biological network. The most elementary characteristic of a node is its degree, i.e. the number of edges connecting one node to its neighbours. The probability distribution of the degrees over the whole network is called degree distribution. In random networks, most nodes have a similar number of links and their degree distribution follows the Poisson distribution. In contrast, many real-world networks, including most biological networks, are scale free. This means that their degree distribution follows a power law, as most of the nodes have few links and only a few nodes are densely connected [7]. Figure 3 Network Basics. (A) The basic elements of a network are illustrated in this simple network where a circle indicates a node and a line indicates an edge. (B) Networks can be either undirected (upper panel) or directed (lower panel). (C) Hubs (red nodes – or dark grey nodes in Black & White printing) and bottlenecks (green nodes – or medium grey nodes in Black & White printing) are illustrated in this sample graph. Two example modules (green and blue areas – or shadowed areas in Black & White printing) are illustrated as subgroups of nodes and their respective edges. Nodes with high degree, known as ‘hubs’ (illustrated in Figure 3C), can be key players in molecular mechanisms such as (i) a protein interacting with multiple other proteins, (ii) regulation of multiple genes by a key transcription factor or (iii) multipart regulation by other regulatory elements [i.e. microRNAs (miRNAs)]. All these cellular processes may be highly significant in determining the outcome or phenotype of a disease of interest. In human cells, hub genes have been found to indicate essential genes (i.e. critical for survival) rather than disease genes [8]. Another node feature is its betweenness, i.e. the extent to which a node participates in the shortest paths connecting other nodes. Nodes with high betweenness, known as ‘bottlenecks’, can be extremely influential in a network in the sense that they rest in critical junctions between hubs and can therefore represent bridges that allow groups of nodes to cross talk to each other (as illustrated in Figure 3C). Importantly, some of these bottleneck nodes represent key connections that if removed will result in the complete loss of connectivity between clusters of nodes, thus affecting greatly the overall topology and, as a result, the information propagation in the network. In molecular terms, an example of bottleneck nodes is that of proteins whose loss of function leads to deactivation of specific processes. In directed regulatory protein networks, betweenness was shown to be a good predictor of essentiality [9]. Various other network features can be calculated to provide insights into biological networks. Another measure is closeness (a measure of the average length of the shortest paths from one node to other nodes), which indicates important nodes that can communicate quickly with other nodes of the network. For example, in a protein signalling network closeness can be interpreted as the ‘probability’ of a protein to be functionally relevant for several other proteins. An example illustrating how network measures, such as network-efficiency and network-clustering, can be used as biomarkers is the recent study of Blain-Morales et al. [10] where a network was constructed using the alpha bandwidth (8–13 Hz) of the electroencephalogram recordings during anaesthesia in healthy humans. Global network efficiency quantifies the efficiency of information exchange across the whole network and is defined as the average inverse shortest path length over all pairs of nodes. The clustering-coefficient is a measure of the degree to which nodes in a network tend to cluster together (the global measure is calculated by averaging the local clustering-coefficients of all nodes). In Blain-Morales et al. [10] network efficiency was significantly decreased and network clustering-coefficient was significantly increased during anaesthesia-induced unconsciousness. These measures returned to baseline 3 h post-recovery, suggesting that they could be used as potential biomarkers for normal recovery brain networks post general anaesthesia induction. Other network measures such as network size [11], density [11], PageRank versatility [12], path length [10] and modularity [10] can further be used to evaluate networks. For an extensive review of network measures, the reader is referred to [13]. Although topological properties from a graph-theory point of view do not always have a clear biological meaning, in many cases they can be good predictors of functional and disease modules (see [8] for further discussion). Biological network construction methods Biological networks can be split into two broad categories that best characterize their underlying nature: (i) evidence-based molecular networks that rely on experimental evidence for specific molecular interactions such as PPI networks, metabolic networks and regulatory networks (transcription factor—gene networks, non-coding RNA—gene networks) [14–18], (ii) statistically inferred networks, which are based on statistical inference that rely on interactions between components established by means of statistical analysis. Evidence-based molecular networks : The information used to build networks of molecular interactions is obtained from small-, medium- or large-scale experimental data that are usually aggregated and available in online databases [19, 20]. A plethora of information can be derived from multiple resources including PPIs, gene regulatory relationships (including miRNAs) and metabolic pathways, using high-throughput (i.e. whole exome sequencing) and literature-curated data. In addition, valuable pre-compiled information can be derived from databases like Gene Ontology [21], REACTOME [22, 23] and literature-based annotations in Genome Recognition Analysis Internet Link [24]. The construction of biological interaction networks with the goal of uncovering causal relationships constitutes a major research topic in systems biology [25]. Many approaches have been developed to study the interactions among a large number of genes to highlight significant genes for each disease. Certain approaches utilize biological knowledge, to address many biological problems and find genes related to the disease of interest. Construction of networks requires knowledge of PPIs, protein–DNA interactions (PDIs) and/or protein–metabolite interactions (PMIs). Such data can be obtained from open-access databases. For example, PPI data can be obtained from the Search Tool for Recurring Instances of Neighbouring Genes (STRING) [26], the Human Protein Reference Database (HPRD) [27], the Biomolecular Interaction Network Database (BIND) [28], the Molecular INTeraction database (MINT) [29] and the Biological General Repository for Interaction Datasets (BioGRID) [30]. For example, in a recent study, a network-based analysis of mass-spectrometry (MS)-based proteomics data of spinal nerves led to the identification of 19 biological processes to be involved in retrograde motoneurodegeneration and neuroprotection after axonal damage [31]. In this study, the authors used nine public PPI databases to obtain protein interaction data. Furthermore, PDI databases include the EdgeExpressDB (FANTOM4-EEDB) [32], the Transcriptional Regulatory Element Database [33], MSigDB [34], MultiNet [35] and the MetaCore [36]. The KEGG pathway database [37] can be used to obtain PMI data. Nevertheless, as a large number of genes are not functionally characterized, these approaches are compromised owing to lack of available data [38]. Based on this limitation, many statistical network inference methods were developed to construct statistically inferred gene networks, based on omic data from high-throughput technologies, as they provide snapshots of the transcriptome under many tested experimental conditions [39]. Statistically inferred networks: A type of statistical inference network is the ‘co-expression network’, where genes are connected based on statistically significant correlated or anti-correlated (depending on the underlying question) expression profiles with respect to a disease of interest. Another type of statistically generated network is the ‘genetic network’ [40–42]. Sources like the BioGRID [43] database, allow researchers to investigate how the dysregulation of one gene affects the downstream response of another gene and, moreover, how this cascade of molecular functions influences specific disease phenotypes. The basic idea behind the network inference methods is to search for sets of co-expressed genes. Depending on the metric that is used, these methods can be classified into three major categories [38]: (i) Mutual Information-based methods, (ii) Correlation-based methods and (iii) Tree-based methods. Mutual information-based methods calculate the mutual information values of all pairs for a given gene expression profile, and if a pair’s corresponding value is larger than a given threshold then this pair of genes is considered as linked. The resulting network is constructed based on this threshold by including a weighted edge between two genes [44]. The weight can be calculated with several algorithms: ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [45], CLR (Context Likelihood or Relatedness Network) [46], MRNET (Maximum Relevance Minimum Redundancy) [47], MRNETB (Maximum Relevance Minimum Redundancy Backward) [47] and C3NET [48]. In the case of correlation-based methods, the different algorithms calculate the correlation or the partial correlation between pairs of genes. These methods are implemented through algorithms like GeneNet, a statistical learning algorithm which allows the assessment of Graphical Gaussian Models [49], and Weighted Correlation Network Analysis (WGCNA) [50], an algorithm which calculates correlations across each pair of genes. It computes an adjacency matrix using the Spearman correlation, Lasso—a shrinkage and selection method for linear regression [51]—and Adaptive Lasso—another version of Lasso modified to include penalty weights [52]. In the case of tree-based methods, algorithms use tree-based ensemble methods as feature selection techniques to solve a regression problem for each gene in the network. More specifically, the basic idea of tree-based methods in regression is to recursively divide the learning sample with binary tests based each on one input variable (the expression of one gene). These binary tests are optimized to minimize, in the largest amount possible, the variance of the output variable, namely, the expression of another gene from the remaining in the subsets of samples. Candidate divisions compare values from the input variable with a threshold, which is determined during the tree growing. Tree-based ensemble methods are more enhanced than single trees, as they estimate the average predictions of several trees. An example of a tree-based method is the GENIE3 algorithm [53], which emerged as the best performer in a significant network inference challenge [39]. In summary, statistical-inference methods are used to estimate the expression pattern relationships across all pairs of genes driving to co-expression network inference. However, correlation-based methods have a tendency to be algorithmically straightforward and computationally fast, but with the limitation that assume linear relationships among variables. In contrast, methods based on mutual information capture non-linear as well as linear interactions but they can be computationally expensive. From another point of view, tree-based methods are non-parametric and consequently, they do not need to make any assumption about the nature of the data. Tree-based methods can deal effectively with high-dimensionality data. Module-based approaches and network signatures Representing high-throughput data with networks often leads to complex and highly dense networks that cannot be easily interpreted by the human eye. To extract biologically meaningful information from these networks and establish links to disease, methods have been developed for scanning and parsing these networks. These methods allow for significant sub-networks to be highlighted in the sea of nodes and edges, often representing important ‘modules’ that are associated with a specific disease [8, 54, 55] (as illustrated in Figure 3C). This type of network ‘traversing’ can be performed between networks obtained from data from different phenotypes from the same disease (staging, subtyping) or from similar diseases (disease hierarchy). Through this way, common/shared modules can be identified by looking at network intersections. Alternatively, unique modules reflecting molecular signatures exclusive to specific conditions or phenotypes can be extracted. Depending on the integration mode, the identification of modules may lead to ‘active modules’ (by integrating molecular profiles and highlighting activity of nodes/interactions), ‘conserved modules’ (by comparing multiple species/states and concluding to conserved subnetworks), ‘differential modules’ (by comparing states and conditions and concluding to differentiated subnetworks) and ‘composite modules’ (by integrating multi-source information from complementary networks) [56]. Module identification can be performed using Systems Bioinformatics approaches and constitutes a powerful tool for delineating the systematic molecular basis of disease. Module identification thus abides by two assumptions: (i) modules that are specific to a disease of interest are expected to form dense clusters or hubs capable of detection by unsupervised network clustering algorithms, (ii) the functional relationship between the nodes residing within these clusters is expected to be similar with respect to underlying molecular mechanisms, biological processes and cellular/tissue localization [57]. Further functional significance of modules can be derived by using pathway enrichment analysis methods, either in the traditional sense (Fisher’s enrichment analysis) or by prioritizing/ranking pathways and genes based on network topological similarities to already validated disease network components. The latter case leads to another type of module functional annotation and assignment based on associations and connection to neighbouring nodes that are of known/validated functions. This approach has been used to elucidate functional and clinical significance of hazy areas of molecular networks for which associations between genes or proteins and specific disease phenotypes are not available in publicly available resources or through basic high-throughput analyses. There are several different types of example cases that have used network methods to gain information on functional modules and network signatures. Novel information for genes, pathways and other molecular interactions involved in numerous disorders has been discovered using network module-based approaches. These include disorders such as type 2 diabetes mellitus [58–61], Alzheimer’s disease (AD) [62, 63], Parkinson’s disease [60, 64], cardiovascular diseases, asthma [61, 65–67] and a variety of tumours [68, 69]. When existing bioinformatics resources and databases fall short in shedding light on a specific disorder of interest, novel experiments have to be designed and conducted to successfully unravel the clinicopathological, genetic and molecular mechanisms underlying disease. Success stories include studies for spinocerebellar ataxia [70], Huntington’s disease [71] and schizophrenia [72–74]. For example, a recent study utilized large-scale expression data to extract/identify biologically significant modules from gene expression networks [75]. It has been hypothesized that disease tissue specificity is governed by the expression of a specific functional disease module (sub-network) in the tissue of disease manifestation [75]. The authors of this study adopted this hypothesis and used systems approaches to investigate the tissue-specific expression patterns of disease genes in the human interactome. They observed that genes expressed in a specific tissue shared a topological neighbourhood in the human interactome network, in contrast to genes expressed in different tissues. The authors further provided evidence that expression of all components of the tissue-specific disease module was necessary for determining the disease outcome. The construction of this tissue-specific disease network further allowed for predictions on novel disease–tissue relationships. Network controllability Network controllability is defined as the potential to steer a network from a given initial state to a final desired state within a finite time and with appropriate inputs/modifications. Such modifications are also known as ‘network attacks’. Another Systems Bioinformatics’ emerging research direction related to the dynamic properties of complex networks is the way in which these properties are spreading and/or reforming during network attacks. The attack process is usually based on specific mathematical models that decide which nodes-edges or even specific hubs to remove, to examine issues like controllability [76], error tolerance [77], attack vulnerability [77, 78], robustness [79, 80], topological characteristics [81] and control centrality [82]. Cascade attacks [83], degree and betweenness-based attacks and prominence-based attacks are commonly used types of attacks [84]. It has been shown that intentional attacks as well as random failures may easily affect (or even destroy) network functions such as connectivity and synchronization [84, 85], while a small range of node failures can affect the network controllability [86]. In the case of Systems Bioinformatics there are relatively limited, yet of great interest, studies that have used such an approach [87]. ‘Driving nodes’ are highly important nodes in a network that governs its controllability. Control theory shows that to direct a complex network towards a desired state, there is a minimum number of driving nodes. Determining the minimum number of driving nodes can be demanding with respect to both computational resources and time, hence novel non-exhaustive algorithms for determining driving nodes are necessary. A recently published study [88] presents the actuation spectrum method that optimizes the trade-off between driving node prediction and time. The authors validate their methodology across numerous complex networks and show that a small number of driving nodes are sufficient to determine the state of a complex network. Another approach makes use of PPI networks and network controllability. By controlling the structure of human PPI networks, using the correct queues or inputs, it is possible to activate specific cellular processes that determine disease outcome (i.e. apoptosis). A recent study [89] utilized a PPI network of 6339 proteins and 34 813 interactions to perform classification of proteins with respect to their importance in the network. The authors quantified the effects of removing a specific protein from the network by calculating the number of remaining driving nodes. Results showed that the most important proteins according to this analysis were also the primary targets of disease-causing mutations, human viruses and drugs. This study showed that controllability of a network can provide crucial information for the shift between healthy and disease states, at the same time highlighting novel candidate drug targets. Network integration A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers (Figure 2). Different disease modules have been shown to act in synergy. Thus, to obtain a holistic picture of the complex mechanisms that underlie disease manifestation, it is necessary to construct networks of integrated disease modules. Such an integration can be achieved via various ways: (i) By investigating gene association it is possible to construct connections between different disease modules by looking at shared, common genes. This can reflect the genetic basis of diseases and provide associations between diseases of a common genetic background. (ii) Superimposing gene networks with gene expression data from RNA-Seq or microarray analysis can further enrich these networks of disease modules. (iii) A common genetic basis for disease modules can also be established by analysing genetic variants or polymorphisms (i.e. SNPs, indels). Networks of disease modules sharing genetic mutations can lead to important findings such as establishment of linkage and associations of variants as well as environmental factors with disease modules. (iv) Protein interaction network modules can also be merged using common PPIs between disease modules and moreover, overlaying this information with proteomics expression data can provide valuable insights into the proteome of diseases of interest. (v) Looking at common pathways between disease sub-networks can also provide valuable clues as to similarities and/or differences between diseases of interest. (vi) Metabolic pathways can also provide additional information towards the understanding of enzyme catalytic activity for different disease modules. Disorders that affect specific metabolic pathways (i.e. obesity) are more likely to share commonalties in the metabolic networks than diseases that share a genetic basis. (vii) Disease modules can also be linked using regulatory information such as shared miRNA regulators, thus highlighting important commonalities or differences between diseases. Specific cellular components (or modules) associated with a disease are believed to share a topological neighbourhood within the human interactome [90]. In a recent study [90] the authors utilized novel mathematical conditions to map the topological relationships between diseases in the human interactome. They showed that diseases with common expression profiles, symptoms and comorbidity share overlapping modules in contrast to more phenotypically distinct diseases, which appear in distant topological neighbourhoods. These tools can provide valuable insights in predicting drug therapy for diseases with common phenotypes, even if they are genetically distinct. Another recent study [91] adopted a novel, multiple-network-framework integration for epigenetic modules. This method utilized the Epigenetic Module based on Differential Networks (EMDN) algorithm, which simultaneously analyses DNA methylation and gene expression data [91]. Using The Cancer Genome Atlas (TCGA) breast cancer data, the authors reported that the EMDN algorithm could recognize positively and negatively correlated modules. These modules can serve as biomarkers to predict/diagnose breast cancer subtypes by using methylation profiles, where positively and negatively correlated modules are of equal importance in the classification of cancer subtypes. The authors of this study also showed that epigenetic modules also estimate the survival time of patients, and this factor is critical for cancer therapy. Tools that analyse the structure and topology of these integrated networks are of extreme value and can provide insights into the synergistic role of multiple network components in diseases of interest. The methods for network analysis and integration we have discussed so far are mainly used to describe the topology of a biological network (or a set of networks). Although these methods capture the relationships between components, they fail to capture the dynamics, i.e. the time component is not modelled and, thus, simulations to obtain prediction of the evolution of the system cannot be performed. For example, static insights into the molecular basis of a disease do not provide a complete picture with regards to drug response without access to time-dependent data. Hence, the use of mathematical algorithms and computational tools for modelling the dynamics of these networks complements network analysis and is further detailed in the following section. Systems modelling and simulation To test the validity and predict the behaviour of complex biochemical systems, such as gene networks, it is often required to describe the effects of multiple, simultaneous and dynamic interactions within the components of the system that are too complex to interpret intuitively. Developing and simulating mathematical models is essential in investigating such complex biological systems. These complex systems can be further explored using mathematical models to describe the valid structure (i.e. the components of the system and their interactions based on experimental data) and identify the basic underlying principles of their function to predict behavioural responses to a certain perturbation [92]. Two types of models are commonly used to describe biological processes such as gene networks—‘quantitative’ and ‘logical’ [93, 94]. Quantitative models use differential equations to describe the non-linear dynamic interactions in a network, whereas logical models use the Boolean approach to describe dynamics in a qualitative way. Quantitative models provide precise information and can be directly compared with experiments including time-dependent data. However, they require sufficient knowledge of the mechanistic details and kinetic parameters and, thus, they are limited to applications to networks which are well characterized and are of small to moderate size. Logical models do not require such information and can be applied to large-scale networks with known structure, yet only provide limited information, as they cannot provide quantitative predictions and assist in choosing better alternative behaviours. In summary, each modelling approach has its advantages and disadvantages and recent work suggests that hybrid approaches might be optimal for challenges in systems biology (for a detailed discussion see [93]). Mathematical models are indispensable in pharmacology and diagnostics. For example, spatio-temporal mathematical models of the blood coagulation network have been developed to aid drug development and diagnostics (as extensively reviewed in [95]). Another relevant application is the use of mathematical models of drug-targeted pathways (modelled with a set of differential equations based on the mass action law) to explore drug combinations [96]. Classical bioinformatics and systems biology can complement and strengthen each other in drug discovery and therapeutics where concrete predictions are required [92]. The value of combining high-throughput data with mathematical modelling is shown, for example, in devising personalized treatments in cancer (for extensive review see [97]). Integration of multi-omic data can be used as an additional constraint in constraint-based modelling in systems biology (to optimize parameter estimation and validation). For a recent survey summarizing constraint-based metabolomic modelling and multi-omic integration methods see [98]. Infrastructures and data management It is important to highlight some of the modern computing trends that play an important role in driving research in Systems Bioinformatics and facilitate the transition to personalized medicine. The main limiting factor for research laboratories specializing in Systems Bioinformatics is computational power and resources. Significant investment is required to attain high-performance computer (HPC) servers or clusters, which have the capacity to store, manage and process the vast amount of data generated from high-throughput omics technologies. Often sheer maintenance of these machines can be a costly and a limiting factor that disallows the exploitation of the full potential of HPC. Cloud computing promises to solve major issues of system administration for these computer clusters by allowing for the exploitation of HPC, stored and managed in an expert environment, as virtual resources that are made available through the internet. Tool availability is also a major issue and having organized platforms with tools like CytoScape [99], GATK [100], BLAST [101], omics assemblers (i.e. IDBA-UD [102]) and programming languages and packages like R's Bioconductor library for expression data analysis [103], JAVA, Python, SQL and others is of major importance for scientists to facilitate dissemination of algorithms, data and results. Systems bioinformatics applications In this section we present the impact of Systems Bioinformatics on diagnostics and therapeutics by highlighting success stories and cases in a formatted manner: introductory text/data set collection/network construction/network analysis/findings and significance of research. Systems bioinformatics applications in biomarker discovery The use of networks in computational diagnostics via the detection of molecular biomarkers is one of the hallmarks of Systems Bioinformatics. Numerous recent state-of-the-art studies have made use of such networks to characterize cellular systems by simultaneously analysing thousands of genes, proteins, isoforms and complexes to address issues of computational diagnostics. Here, we highlight a few studies, showcasing the essence of networks’ contribution in precision diagnostics. A recent study [62] used a machine learning approach, which integrates topological features from PPI networks, to identify candidate AD-associated genes. Data set collection: Positive and negative data sets were collected from Entrez Gene database at the National Centre for Biotechnology Information (NCBI). The positive data set consisted of 458 genes known to be associated with AD. The negative data set consisted of the additional 55 947 Entrez genes, excluding the AD-associated genes. Network construction: Human PPI data sets were extracted from a variety of sources including Online Predicted Human Interaction Database (OPID), STRING, MINT, BIND and InTAct databases. Network analysis: By utilizing the PPI networks, the authors extracted topological features for the AD- and non-AD-associated genes. These features included nine topological properties of the PPI network for each gene, namely, the average shortest path length, betweenness centrality, closeness centrality, clustering coefficient, degree, eccentricity, neighbourhood connectivity, topological coefficient and radiality. Findings and significance of research: The authors further combined sequence features and functional annotations features and concurrently performed feature selection using seven methods including gain-ratio-based attribute evaluation, oneR algorithm, chi-square-based selection, correlation-based selection, information gain-based attribute evaluation and relief-based selection. The most important features were fed into 11 machine learning algorithms to generate classifiers using the training data set capable of predicting AD- and non-AD-associated genes using the selected network, sequence and functional features. Methods included Naive Bayes (NB), NB Tree, Bayes Net, Decision table/NB hybrid classifier, Random Forest, J48, Functional Tree, Locally Weighted Learning (J48 + k-nearest neighbour), Logistic Regression and Support Vector Machine. Training of sophisticated machine learning classifiers using systemic properties can be a key feature in generating personalized medicine diagnostic approaches. The authors finally combined diagnostics with therapeutics by screening 45 known anti-Alzheimer drugs from DrugBank against novel predicted probable AD targets, obtained from their trained classifiers, using molecular docking. They further proposed a novel candidate untried drug, AL-108, with high affinity to potential therapeutic targets. Additional tools were also used to validate preliminary findings, including molecular dynamics simulations and MM/GBSA calculations on the docked complexes [62]. Another interesting study [104] used data from TCGA [105] to successfully construct a multidimensional subnetwork atlas for cancer prognosis. The authors addressed how multiple genetic and epigenetic factors (i.e. gene expression, copy number variation, miRNA expression and DNA methylation) affect molecular states of networks and patient survival. Dat aset collection: The multidimensional cancer-associated data sets for 1027 patients for four cancer types were collected from TCGA Cancer Browser (https://genome-can cer.ucsc.edu/proj/site/hgHeatmap/). They contained clinical information, copy-number variation, promoter DNA methylation, mRNA-gene and miRNA expression data. They furthermore extracted PPI data from HPRD for network construction. To enrich these networks with additional miRNA-regulatory information the authors extracted miRNA and target gene information from two miRNA target databases [miRTarBase (Release 4.5) and TarBase v6], which provide experimentally validated miRNA–target interactions. Network construction: PPI interaction network was constructed using data collected from HPRD. Network analysis: The authors fitted a univariate Cox proportional hazards model between each molecular feature and patient survival time and thus scored each gene based on its significance to predict survival. Genes with a positive score were considered as survival-related genes. They next used this score (heat score) as the input into HotNet2, which uses a heat diffusion process and a statistical test-based algorithm to discover subnetwork signatures in the PPI network. Through this way subnetwork signatures of survival-related genes were determined both by the scores of their genes as well as gene topology in the PPI network. Findings and significance of research: The authors then used Monte Carlo cross-validation and permutation testing procedure to assess predictive power of the subnetworks on patient overall survival. They used a Cox proportional hazards model with L1 penalized log partial likelihood (LASSO) for feature selection to train the models based on the molecular profile of individual subnetworks. Finally, the prognostic outcomes for the training set were used to determine the regression coefficients. These coefficients were then used in the testing model to predict outcomes for patients in the test set and calculate the concordance index (C-index). Results reveal novel PPI subnetworks with significant prognostic capabilities for a variety of cancer types. The authors further validated their subnetworks by performing prognostic impact evaluation, functional enrichment analysis, drug target annotation, tumour stratification and independent validation. They highlighted distinct pathways in the underlying subnetworks as potential new targets for therapeutic intervention for certain cancer types. This study integrated the protein interactome with cancer genomics data, thus allowing for a systemic analysis of the molecular mechanisms that underlie genesis of cancer and provides new directions in personalized cancer therapy [104]. Another recent study [106] adopted an approach that uses an enriched library of single-stranded oligodeoxynucleotides to profile complex biological samples. This method allows for the analysis of systemic native biomolecules. The authors defined their method as Adaptive Dynamic Artificial Poly-ligand Targeting and further utilized it as a diagnostic tool to profile plasma exosome of cancer patients. They achieved high classification accuracy in breast cancer patients by analysing the circulating exosomes in their blood [106]. The online database MelGene is yet another example of successful integration of Systems Bioinformatics approaches in current research for molecular diagnostics. This tool provides a comprehensive, regularly updated collection of data from genetic association studies in cutaneous melanoma, including random-effects meta-analysis results of all eligible polymorphisms [107]. The MelGene proposed network connections highlight potentially new loci in relation to melanoma risk. Recent studies have shown that interpretation of proteomics data using network-based approaches can offer additional insights into the mechanistic and dynamics of protein assemblies, and hence into the molecular mechanisms underlying the system under study. Moreover, network-based approaches can be used to reconstruct a disease-perturbed cellular network model showing the interactions of identified differentially expressed proteins involved in selected cellular pathways related to the target pathophysiology. For example, Shirasaki et al. [108] have used affinity purification coupled to MS to investigate the proteome profile of Huntington’s disease. In particular, using a monoclonal antibody against huntingtin (Htt), they identified 747 proteins to be complexed with Htt. A systems-level view of Htt interactome was achieved by using WGCNA, which was used to construct weighted links between the Htt co-purifying proteins. Using topological overlap, the data were clustered into eight Htt-interactome modules that were related to distinct functional aspects such as brain region specificity, aging and protein aggregation modulation or Htt functions directly [108]. Moreover, several network-based approaches have been developed that can identify the cellular pathways which are altered under pathophysiological conditions, and can hence enrich biomarker discovery. For example, functional enrichment analysis of GO biological processes or KEGG pathways [37] of differentially expressed proteins can be performed using both free licence tools such as Database for Annotation, Visualization and Integrated Discovery (DAVID) [109], Protein ANalysis THrough Evolutionary Relationships (PANTHER) [110] and Gene Set Enrichment Analysis (GSEA) [34], as well as commercialized tools such as MetaCore [36] and Ingenuity Pathway Analysis. Furthermore, pathway topology approaches have been developed as alternative to enrichment analysis. For example, Signalling Pathway Impact Analysis [111] and Network Perturbation Amplitude [112] deliberate whether the proteins involved in functional modules defined by other databases interact with each other in cellular networks. Various tools are currently available, which can aid the Systems Bioinformatics application in biomarker discovery. GWAB, a recent tool, makes use of systems approaches and computational methods to boost weak association signals for Genome Wide Association Studies (GWAS), a common problem when analysing this type of data. This tool works by incorporating publicly available data in the form of using GWAS summary statistics (p-values) for SNPs along with reference genes for a disease of interest. The authors demonstrated the feasibility of boosting GWAS disease associations using gene networks and further present a web server for GWAB, for the network-based boosting of human GWAS data [113]. Other tools like GeneMANIA [114] and PINTA [115] allow for gene prioritization and gene function prediction and can greatly aid in computational diagnostics. Systems Bioinformatics applications in drug discovery Systems Bioinformatics contributes in computational therapeutics by providing tools and algorithms for novel drug discovery. Research in this direction is often done in close collaboration with pharmaceutical companies. One of the main challenges faced by both the research community and the industry is the prediction of adverse drug effects, especially during the early stages of drug development. These types of predictions can lead to significant cost reductions by allowing for accurate drug assessment and discontinuation of development for drugs with severe adverse effects. The use of human genetic variation has been known to play an important role in drug response [116]; however, the effect of this factor alone is not sufficient to provide a complete perspective on the matter in hand. Systems pharmacology is a term that is widely used today in many high-calibre, recently published studies [117–119]. Systems pharmacology is a systems biology approach, which focuses on enhancing the understanding of drugs function in the human body at a systems’ level, described by several types of networks, rather than looking at the effect of single molecular components. It shifts away from traditional practice, which considers the effects of a drug with respect to its target protein and instead strives to address the effects of the drug by considering a network of drug–target interactions. Systems Bioinformatics is a precious field in the neighbourhood of systems pharmacology that provides important methods and tools for multi-source and multilevel integration of the omics spectrum with drug networks shedding light in the area of modern drug discovery. A recent area of great interest where Systems Bioinformatics can be of substantial impact and value is the area of drug repurposing or repositioning [120]. This entails the use of Food and Drug Administration (FDA)-approved drugs to treat new diseases, which are different from the ones they were initially designed for. This allows for obvious shortcuts for pharmaceutical companies allowing them to by-pass the timely and costly process of FDA approval for novel drugs. Recent studies used gene expression data derived from microarrays or RNA-Seq data to obtain specific expression profiles for specific diseases of interest. By comparing these to collections of data sets from repositories such as CMap [121], Drugmap Central and more advanced versions like LINCS and the recent Drug Repurposing Hub [121, 122] allows for alternative drugs to be proposed for the treatment of diseases under investigation. In a recent study [38], this approach was used to devise drug/target networks obtained from algorithms of mutual information and co-expression networks aiming to gain insights into the treatment of breast cancer subtypes. Data collection: TCGA mRNA (microarray) gene expression data for Breast Invasive Carcinoma cases were obtained from Firehose (http://gdac.broadinstitute.org/). From a total of 587 samples (526 primary solid tumour samples and 61 primary solid normal samples—17.814 genes), the authors selected a subset of tumour data containing information regarding breast cancer staging, HER2, ER and PR status with their corresponding normal samples as well as breast cancer stages I, II, III and IV. Network construction: The authors examined three major categories of statistical network inference methods: (i) mutual information-based methods, (ii) correlation-based methods and (iii) tree-based methods. They further utilized Biological information-based network methods and one ensemble scheme using all statistical network inference methods. They used the Cytoscape platform and more specifically the GeneMania plug-in to reconstruct the biological information-based gene network. This plug-in uses a large data set unifying functional networks comprising approximately 800 networks for six organisms including Homo sapiens. Using the H.sapiens network they constructed a sub-network for the top 1000 differentially expressed genes (DEGs) from the TCGA data set merging five Network types: Co-expression, Physical Interaction, Genetic interaction, Co-localization and Pathways. Network analysis: The authors further performed gene re-ranking using the underlying networks. To investigate the influence of the reconstructed 17 gene networks (12 statistically and 5 biologically inferred) on gene prioritization, they applied a method that allows for a custom network selection combining the log fold change absolute values with the selected underlying network topology to re-rank the initial DEGs. The basic idea of the method is the reconciliation of the gene expression values taking into account the underlying gene network topological features such as degree and betweenness. The network patterns were further analysed to investigate their exclusive contribution with respect to breast cancer subtypes and stages. The authors then performed drug repurposing using the up- and down-regulated genes forming disease signatures by querying them in a well-established drug repurposing pipeline, namely, LINCS-L1000 (http://www.lincscloud.org/), an advanced version of CMap. In summary, the authors obtained 63 unique drugs for the breast cancer stages and 58 for the breast cancer subtypes. To further examine the resulting drugs, the authors constructed super networks by combining top drugs extracted from their analysis with the FDA-approved breast cancer drugs, connecting them with their target genes and superimposing these on the gene expression networks. Findings and significance of research: The authors performed an analysis that concluded to eight network patterns, four for the stages (I, II, III and IV) and four for the subtypes (Triple Negative, Luminal A, Luminal B and HER2). These patterns were shown to highlight four exclusive stage-related pathways including phenylalanine metabolism for Stage II, peroxisome proliferator-activated signalling pathway and glycolysis and gluconeogenesis for Stage III and toll-like receptor signalling pathway for Stage IV. Finally, the authors performed drug repurposing to elucidate potential anti-breast-cancer properties for known drugs and they compared the molecular structure for their predicted re-purposed drugs against 25 FDA-approved drugs of clinical use. Two out of these 25 drugs (Gemcitabine and Palbociclib) were also found as repurposed drugs by the authors. In Stage I, two repurposed drugs, Clofarabine and Kinetin-riboside, were found to be structurally similar to Gemcitabine. Clofarabine seems to have potential efficacy in epigenetic therapy of solid tumours, especially at early stages of carcinogenesis. Another recent line of work [64] performed network-based in silico drug efficacy screening by exploiting network-based approaches. The authors investigated the association between drug targets and diseases, presenting a drug–disease proximity measure [64]. Data collection: The authors used 1489 diseases defined by Medical Subject Headings (MeSH) compiled in a recent study [90]. For each disease, the disease–gene associations were collected from OMIM and GWAS catalogue. For each disease, the authors extracted information on FDA-approved drugs from DrugBank and matched 79 of these diseases with at least one drug using tools like MEDI-HPS and Metab2Mesh resulting in 238 unique drugs and 384 targets. The authors took information published by [90] that contained experimentally documented human protein physical interactions from TRANSFAC, IntAct, MINT, BioGRID, HPRD, KEGG, BIGG, CORUM, PhosphoSitePlus and a large-scale signalling network. Network construction: The human PPI network was compiled using information extracted from the databases described above, to generate an elaborate human interactome. The largest connected component of this interactome was consequently used in their analysis, consisting of 141 150 interactions between 13 329 proteins. Entrez Gene IDs were used to map disease-associated genes to the corresponding proteins in the interactome. Network analysis: The proximity between a disease and a drug was evaluated using various distance measures that take into account the path lengths between drug targets and disease proteins. The authors focused on two types of network-based proximity relationships between drugs and disease proteins: (i) the most straightforward measure is the average shortest path length between all targets of a drug and the proteins involved in the same disease; (ii) the second proximity is the closest measure, representing the average shortest path length between the drug’s targets and the nearest disease protein. Findings and significance of research: The authors validated their approach and optimized their proximity thresholds by assessing how well relative proximity discriminates 402 known drug–disease pairs from the 18 162 unknown drug–disease pairs by comparing the area under Receiver Operating Characteristic curve for different distance measures. Based on these results the authors showed that network proximity delineates therapeutic effects of a drug. This approach of utilizing network proximity in the interactome for drug targets and diseases, allowed for increased understanding in the therapeutic effect of drugs. They made use of cases from Parkinson’s disease and several inflammatory disorders to further substantiate findings. This approach can potentially have significant applications in drug discovery, drug repurposing and assessment of drug adverse effects. Another study [123] led to the development of a current state-of-the-art tool that addresses computational therapeutics from a network perspective, the TCM-Mesh system. This tool allows for the high-throughput network pharmacology analysis for Traditional Chinese Medicine (TCM) [123]. Data set collection: TCM utilizes data curated from collections of 6235 herbs, 383 840 compounds, 14 298 genes, 6204 diseases, 144 723 gene–disease associations, 3 440 231 pairs of gene interactions, 163 221 side-effect records and 71 toxic records (data as of April 2017). The information for traditional Chinese herbs and traditional Chinese medicine preparation was extracted from TCM Database@Taiwan, TCMID, information of compounds and their targets; diseases and their related proteins were obtained from STITCH and OMIM, respectively; the protein interactions were obtained from STRING; the toxic and side-effect records of compounds were derived from TOXNET and SIDER. Network construction: The authors used Cytoscape as well as a web-based software to facilitate visualization of a compound–gene–disease network construction between TCM and treated diseases. Network analysis: The authors based their network analysis and scored their compounds using the combined score as defined and obtained from the STITCH database. This score represents the strength of the links between the compounds and their associated proteins. Findings and significance of research: The authors used 1293 FDA-approved drugs, as well as compounds from a herbal material Panax ginseng and a patented drug Liuwei Dihuang Wan for evaluating their database. By comparison of different databases, as well as checking against literature, they demonstrated the completeness, effectiveness and accuracy of the TCM-Mesh database and further aided in increased understanding of the molecular mechanisms of TCM action. Various tools are currently available, which can aid the Systems Bioinformatics application in drug discovery. For example, tools like Substructure-Drug-Target Network-Based Inference SDTNBI [124], C(2) Maps [125], Chem2Bio2RDF [126] and PROMISCUOUS [127] cumulatively provide integrated systems and pharmacology databases for chemoinformatics analysis, drug-target prediction, networks of disease–gene–drug connectivity relationships as well as drug repositioning analysis. For a full list of tools and databases adopting or supporting Systems Bioinformatics methodologies, see Table 1. A more comprehensive list of related tools and databases going back to 2010 can be found in Supplementary Table S1. Table 1 Tools and databases for systems bioinformatics approaches in therapeutics, diagnostics, network visualization/analysis, integration and systems modelling Discussion The concept of utilizing networks to visualize the complex interaction of mechanisms implicated in disease has been around for several years. However, two important breakthroughs separate previous network-based approaches and are currently driving the state-of-the-art in Systems Bioinformatics: (i) construction of multiple networks representing each level of the omics spectrum and the integration of these in a layered network that exchanges information within and between layers [62, 67, 97] and (ii) the advent of novel techniques and methodologies for analysing and understanding these networks using mathematical algorithms and approaches derived from graph theory and information theory [6]. Using these methods for extracting biologically meaningful information from multiple levels of the omics spectrum can provide the integrated systemic knowledge for the development of a comprehensive Human System profile, which increases diagnostic accuracy and concurrently allows for novel therapeutic advances and assess response to therapy. Nevertheless, the network-based approaches, either for evidence-based or for statistically inferred molecular networks, have a number of limitations. Specifically, networks based on experimental evidence are not complete as experiments are only a snapshot of the real biological world. Moreover, statistically inferred networks represent an undetermined computational problem because the number of the inferred relationships is much larger than the number of the independent measurements [196]. Owing to the lack of sufficient ground truth to validate the reconstructed molecular networks, special attention must be given when choosing benchmark data sets (e.g. existing curated databases with experimental information, medium-throughput experimental information and simulated data sets that mimic real data). To this extent, the DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative facilitates researchers from the Systems Bioinformatics field to assess the validity of the networks they are using and proceed with optimization and parameter-tuning regarding network reconstruction [197]. A subsequent limitation of the network-based approaches is the low overlap that the various network reconstruction approaches have, and the inadequacy in selecting the proper network each time. It is likely that computational approaches checking and exploiting complementarities and providing ensemble solutions of a network construction consensus will maximize the information content. In our opinion, Systems Bioinformatics might currently appear rather aspirational, yet, considering its potential it is likely to have a major impact on medicine and pharmacology in the next decade. The field of medicine is expected to benefit from the invaluable knowledge attained from Systems Bioinformatics methodologies. The molecular basis of complex, polygenic diseases is highly heterogeneous and affected by multiple factors simultaneously. These include genetic predisposition, multipart molecular mechanisms and effects of the environment, diet, drug administration/response and numerous other factors. Although it might not be possible to replace the use of traditional approaches for therapeutics and diagnosis with computational methods, yet, it is likely that Systems Bioinformatics will provide revolutionary approaches and tools to clinicians in order to demystify the complex nature of these diseases. Computational diagnostics and therapeutics, enhanced by Systems Bioinformatics approaches, will not only aid clinicians in patient consultation and care but will also catalyse significant breakthroughs in prognostic measures, detection of disease at an early onset and overall disease prevention. Supplementary Material bbx151_Supp Click here for additional data file.
Section	Introduction Biological data, either as large-scale omics or as classical biodata, are the footprints of biological mechanisms. These mechanisms consist of numerous synergistic effects emerging from various systems of interwoven biomolecules, cells and tissues. Therefore, it is necessary to explore them with a systemic approach to reveal the behaviour of the system as a whole rather than as the sum of its parts. Systems biology provides a holistic perspective on biological mechanisms via the integration of information and knowledge from multiple interdisciplinary fields (such as biology, chemistry, mathematics, computer science and physics). It aims to elucidate synergistic relationships between multiple factors in contrast to representing them as single entities and can lead to the generation of complex molecular networks of interactions modelled by computational or mathematical approaches. Systems biology harnesses its power from technological advances in the field of ‘omics’ and the advent of next-generation sequencing. These technologies provide a spectrum of information ranging from genomics, transcriptomics and proteomics to epigenomics, pharmacogenomics, metagenomics and metabolomics. Bioinformatics and computational biology have made significant breakthroughs towards the analysis and interpretation of the data obtained from the above-mentioned omics technologies. The sheer size of data generated by these high-throughput methodologies, coupled with the need to analyse, integrate and concurrently interpret this avalanche of information in a systemic way, has paved the way to the upcoming field of Systems Bioinformatics. Systems Bioinformatics is a relatively new approach, which lies in the intersection of systems biology and classical bioinformatics. It focuses on integrating information across different levels using a bottom-up approach—as adopted in systems biology—with a data-driven top-down approach used in bioinformatics. The bottom-up approach in systems biology typically brings together information from molecular cells and tissues in the framework of mathematical models to generate insights on the function and dynamic behaviour of cells, organs and organisms. The top-down approach uses bioinformatics methods to extract and analyse information from ‘omics’ data generated through high-throughput techniques. Initially, informatics approaches to systems biology focused mainly on modelling and simulation. However, owing to the lack of sufficient experimental data, these methods fell short in building reliable models. Subsequently, the explosion of multilevel data generation brought a plethora of new methods, tools and solutions capable of studying systemic properties. The application of systemic approaches such as information theory, statistical inference, probabilistic models, graph theory and further network science approaches in the analysis of biological data paved the way to the creation of a distinct field, namely, Systems Bioinformatics. Depending on the availability, the quality and the comprehensiveness of the data, Systems Bioinformatics’ methods contribute significant benefit in narrowing down the gap between genotype to phenotype as well as providing additional information regarding biomarker and drug discovery. These methods are applied to classical biological data, clinical/patient data and omics data as well. They are suitable for extracting precise and personalized results, thus, facilitating systems medicine (medicine that is in bidirectional interaction with computational multiscale analysis and modelling of disease-related mechanisms) and more specifically P4 Medicine (medicine that is personal, participatory, predictive and preventive) [1, 2] (as illustrated in Figure 1). Figure 1 Systems Bioinformatics. A schematic representation of the emergence of Systems Bioinformatics as a distinct discipline among other interrelated and interdependent disciplines. The information provided by Bioinformatics, Biology and Systems Biology is integrated in the Systems Bioinformatics framework through computational integration and network-based and other holistic approaches to tackle challenges in Systems Medicine and in particular P4 Medicine. The delivery of individually adapted medical care of high precision, based on multi-source patient information across various levels and in various scales, is the basic idea of modern medicine having various appellations depending on the emphasis given (e.g. translational/systems/P4/precision/personalized medicine). The omics spectrum offers the opportunity and the challenge for multiscale and multi-source analysis towards building a comprehensive profile of the Human System (Figure 2). The major challenges faced by Systems Bioinformatics towards this demanding form of medicine are: (i) the design and development of suitable bioinformatics pipelines to provide valid and sufficient biological information from the high-throughput molecular profiles of the patient, (ii) the development of robust information systems capable for data integration, information extraction and knowledge sharing, (iii) the construction of mathematical models to predict the evolution of a particular disease, its relation with the measured markers, its tolerance/resistance to various drug families and the existing risks to the patient. These challenges can be tackled with state-of-the-art computational methodologies and techniques, such as computational intelligence, machine learning, pattern recognition and data mining, modelling and simulation, network reconstruction and visualization, complex network analysis, deep learning, text mining/semantics and association analysis. Further to these, Systems Bioinformatics serves as the framework for the development of powerful computational methods and tools to create user-friendly platforms to visualize and analyse big and heterogeneous information in the form of a network. Figure 2 Network Integration. Multiscale and multisource data generated from the Human System can be represented in network form. These networks can be further analysed and, importantly, they can be integrated forming supernetworks and building a comprehensive profile of the Human System. This review is structured in three main sections. In the first section on ‘Systems Bioinformatics’ we begin with an overview of the systems theory approach for complex biological problems. We then provide an in-depth summary of the network science approach in System Bioinformatics. We introduce certain basic network measures, which are used to analyse the components of a network, both locally and globally, and discuss the biological interpretation of such measures. We then describe in detail key biological network construction methods followed by a discussion on module-based approaches and network signatures. Finally, we describe network manipulation methods in the ‘Network controllability’ and ‘Network integration’ subsections. In the following subsection we provide a short summary regarding modelling and simulation approaches followed by a short discussion on the infrastructures and data management challenges in the field of Systems Bioinformatics. In the last two sections we provide an overview of methods and case studies with regards to the Systems Bioinformatics applications in biomarker and drug discovery.
Title	Introduction
Figure caption	Figure 1 Systems Bioinformatics. A schematic representation of the emergence of Systems Bioinformatics as a distinct discipline among other interrelated and interdependent disciplines. The information provided by Bioinformatics, Biology and Systems Biology is integrated in the Systems Bioinformatics framework through computational integration and network-based and other holistic approaches to tackle challenges in Systems Medicine and in particular P4 Medicine.
Figure caption	Figure 2 Network Integration. Multiscale and multisource data generated from the Human System can be represented in network form. These networks can be further analysed and, importantly, they can be integrated forming supernetworks and building a comprehensive profile of the Human System.
Section	Systems Bioinformatics Systems approaches Biological data have tremendously expanded both in size and complexity. Systems Bioinformatics focuses on the investigation of such vast and complex biological systems and their within interactions using a ‘holistic’ rather than a ‘reductionist’ approach, much like the systems biology field. A holistic approach to science and the analysis and description of a complex phenomenon emphasizes the whole and the interaction of its parts, whereas the reductionist approach focuses on the fundamental parts. In fact, the debate on reductionism versus holism has its roots in ancient years. According to reductionism proponents, the optimal method to understand any science is the decomposition in smaller components. Moreover, in its greedy form, reductionism may see the whole science as physics. Even in its layered-model form, reductionism considers human/health sciences as based on biology, biology based on chemistry and chemistry based on physics. On the other hand, a strong dissent has formulated a solid antireductionism trend. This trend has either epistemological or ontological origins, supporting that complete reductionism is technically impossible and that there are emergent laws that govern the system and cannot be derived from the laws governing the components of the system. Furthermore, it is supported that each system has a ‘buffering capacity’ where many micro-states correspond to fewer macro-states of the system, making reductionism to be considered as pointless after a certain decomposition level [3]. The reductionism’s approach in biology is epitomized by molecular biology, which in the past two decades has led to the generation of a plethora of omics data. These data provide information on the building blocks of the entire organism at different scales and for different types of cells, tissues and organs. Data on DNA fragments, genes, RNA fragments, peptides, proteins and metabolites measured in short time and space intervals provide a spatiotemporal distribution of these building blocks under various states of the organism. These interwoven building blocks control and are controlled by signals in a non-linear way. As a result, the understanding of the system requires something more than simply the bottom-up assembly of the system’ components. Systems theory, which is a holistic approach, addresses the limitations from the reductionism’s point of view by considering the system as a whole, adopting a top-down approach [4]. Thus, it studies the emergent properties of the system such as homeostasis, adaptivity, tolerance, stability and modularity, through some basic overlying hierarchical principles such as entropy, positive and negative feedback control [4]. In the context of Systems approaches, the graph theory and the further science of networks have been successfully applied to the investigation of complex phenomena across a range of different scientific disciplines. The theoretical context of complex networks approach includes concepts that are derived from information theory, dynamical systems, statistical physics and topology approaches, as well as several mathematical methods suited for the analysis of the interaction of components in a complex network. In the following subsection we provide an in-depth overview of such network approaches and examples of their use in Systems Bioinformatics. Networks Biological network basics Casting biological systems as networks and analysing their topology can be useful in understanding how such systems are organized. Graph theory provides a powerful mathematical framework for the understanding of the organization of such large and complex systems by considering them in the form of graphs [5, 6]. Graphs, also termed as networks, can be used to model the pairwise relations between objects. A network is a collection of nodes or vertices connected by edges, arcs or lines (as shown in Figure 3A). It may be undirected, meaning that there is no distinction between the two nodes associated with each edge, or its edges may be directed from one node to another (as shown in Figure 3B). In cell biology, nodes represent cellular components (e.g. proteins) and edges represent interactions or other relationships between these components [e.g. protein–protein interactions (PPIs)]. Basic network measures can be used to analyse the components of a network, both locally and globally, and facilitate the analysis and extraction of useful information from a biological network. The most elementary characteristic of a node is its degree, i.e. the number of edges connecting one node to its neighbours. The probability distribution of the degrees over the whole network is called degree distribution. In random networks, most nodes have a similar number of links and their degree distribution follows the Poisson distribution. In contrast, many real-world networks, including most biological networks, are scale free. This means that their degree distribution follows a power law, as most of the nodes have few links and only a few nodes are densely connected [7]. Figure 3 Network Basics. (A) The basic elements of a network are illustrated in this simple network where a circle indicates a node and a line indicates an edge. (B) Networks can be either undirected (upper panel) or directed (lower panel). (C) Hubs (red nodes – or dark grey nodes in Black & White printing) and bottlenecks (green nodes – or medium grey nodes in Black & White printing) are illustrated in this sample graph. Two example modules (green and blue areas – or shadowed areas in Black & White printing) are illustrated as subgroups of nodes and their respective edges. Nodes with high degree, known as ‘hubs’ (illustrated in Figure 3C), can be key players in molecular mechanisms such as (i) a protein interacting with multiple other proteins, (ii) regulation of multiple genes by a key transcription factor or (iii) multipart regulation by other regulatory elements [i.e. microRNAs (miRNAs)]. All these cellular processes may be highly significant in determining the outcome or phenotype of a disease of interest. In human cells, hub genes have been found to indicate essential genes (i.e. critical for survival) rather than disease genes [8]. Another node feature is its betweenness, i.e. the extent to which a node participates in the shortest paths connecting other nodes. Nodes with high betweenness, known as ‘bottlenecks’, can be extremely influential in a network in the sense that they rest in critical junctions between hubs and can therefore represent bridges that allow groups of nodes to cross talk to each other (as illustrated in Figure 3C). Importantly, some of these bottleneck nodes represent key connections that if removed will result in the complete loss of connectivity between clusters of nodes, thus affecting greatly the overall topology and, as a result, the information propagation in the network. In molecular terms, an example of bottleneck nodes is that of proteins whose loss of function leads to deactivation of specific processes. In directed regulatory protein networks, betweenness was shown to be a good predictor of essentiality [9]. Various other network features can be calculated to provide insights into biological networks. Another measure is closeness (a measure of the average length of the shortest paths from one node to other nodes), which indicates important nodes that can communicate quickly with other nodes of the network. For example, in a protein signalling network closeness can be interpreted as the ‘probability’ of a protein to be functionally relevant for several other proteins. An example illustrating how network measures, such as network-efficiency and network-clustering, can be used as biomarkers is the recent study of Blain-Morales et al. [10] where a network was constructed using the alpha bandwidth (8–13 Hz) of the electroencephalogram recordings during anaesthesia in healthy humans. Global network efficiency quantifies the efficiency of information exchange across the whole network and is defined as the average inverse shortest path length over all pairs of nodes. The clustering-coefficient is a measure of the degree to which nodes in a network tend to cluster together (the global measure is calculated by averaging the local clustering-coefficients of all nodes). In Blain-Morales et al. [10] network efficiency was significantly decreased and network clustering-coefficient was significantly increased during anaesthesia-induced unconsciousness. These measures returned to baseline 3 h post-recovery, suggesting that they could be used as potential biomarkers for normal recovery brain networks post general anaesthesia induction. Other network measures such as network size [11], density [11], PageRank versatility [12], path length [10] and modularity [10] can further be used to evaluate networks. For an extensive review of network measures, the reader is referred to [13]. Although topological properties from a graph-theory point of view do not always have a clear biological meaning, in many cases they can be good predictors of functional and disease modules (see [8] for further discussion). Biological network construction methods Biological networks can be split into two broad categories that best characterize their underlying nature: (i) evidence-based molecular networks that rely on experimental evidence for specific molecular interactions such as PPI networks, metabolic networks and regulatory networks (transcription factor—gene networks, non-coding RNA—gene networks) [14–18], (ii) statistically inferred networks, which are based on statistical inference that rely on interactions between components established by means of statistical analysis. Evidence-based molecular networks : The information used to build networks of molecular interactions is obtained from small-, medium- or large-scale experimental data that are usually aggregated and available in online databases [19, 20]. A plethora of information can be derived from multiple resources including PPIs, gene regulatory relationships (including miRNAs) and metabolic pathways, using high-throughput (i.e. whole exome sequencing) and literature-curated data. In addition, valuable pre-compiled information can be derived from databases like Gene Ontology [21], REACTOME [22, 23] and literature-based annotations in Genome Recognition Analysis Internet Link [24]. The construction of biological interaction networks with the goal of uncovering causal relationships constitutes a major research topic in systems biology [25]. Many approaches have been developed to study the interactions among a large number of genes to highlight significant genes for each disease. Certain approaches utilize biological knowledge, to address many biological problems and find genes related to the disease of interest. Construction of networks requires knowledge of PPIs, protein–DNA interactions (PDIs) and/or protein–metabolite interactions (PMIs). Such data can be obtained from open-access databases. For example, PPI data can be obtained from the Search Tool for Recurring Instances of Neighbouring Genes (STRING) [26], the Human Protein Reference Database (HPRD) [27], the Biomolecular Interaction Network Database (BIND) [28], the Molecular INTeraction database (MINT) [29] and the Biological General Repository for Interaction Datasets (BioGRID) [30]. For example, in a recent study, a network-based analysis of mass-spectrometry (MS)-based proteomics data of spinal nerves led to the identification of 19 biological processes to be involved in retrograde motoneurodegeneration and neuroprotection after axonal damage [31]. In this study, the authors used nine public PPI databases to obtain protein interaction data. Furthermore, PDI databases include the EdgeExpressDB (FANTOM4-EEDB) [32], the Transcriptional Regulatory Element Database [33], MSigDB [34], MultiNet [35] and the MetaCore [36]. The KEGG pathway database [37] can be used to obtain PMI data. Nevertheless, as a large number of genes are not functionally characterized, these approaches are compromised owing to lack of available data [38]. Based on this limitation, many statistical network inference methods were developed to construct statistically inferred gene networks, based on omic data from high-throughput technologies, as they provide snapshots of the transcriptome under many tested experimental conditions [39]. Statistically inferred networks: A type of statistical inference network is the ‘co-expression network’, where genes are connected based on statistically significant correlated or anti-correlated (depending on the underlying question) expression profiles with respect to a disease of interest. Another type of statistically generated network is the ‘genetic network’ [40–42]. Sources like the BioGRID [43] database, allow researchers to investigate how the dysregulation of one gene affects the downstream response of another gene and, moreover, how this cascade of molecular functions influences specific disease phenotypes. The basic idea behind the network inference methods is to search for sets of co-expressed genes. Depending on the metric that is used, these methods can be classified into three major categories [38]: (i) Mutual Information-based methods, (ii) Correlation-based methods and (iii) Tree-based methods. Mutual information-based methods calculate the mutual information values of all pairs for a given gene expression profile, and if a pair’s corresponding value is larger than a given threshold then this pair of genes is considered as linked. The resulting network is constructed based on this threshold by including a weighted edge between two genes [44]. The weight can be calculated with several algorithms: ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [45], CLR (Context Likelihood or Relatedness Network) [46], MRNET (Maximum Relevance Minimum Redundancy) [47], MRNETB (Maximum Relevance Minimum Redundancy Backward) [47] and C3NET [48]. In the case of correlation-based methods, the different algorithms calculate the correlation or the partial correlation between pairs of genes. These methods are implemented through algorithms like GeneNet, a statistical learning algorithm which allows the assessment of Graphical Gaussian Models [49], and Weighted Correlation Network Analysis (WGCNA) [50], an algorithm which calculates correlations across each pair of genes. It computes an adjacency matrix using the Spearman correlation, Lasso—a shrinkage and selection method for linear regression [51]—and Adaptive Lasso—another version of Lasso modified to include penalty weights [52]. In the case of tree-based methods, algorithms use tree-based ensemble methods as feature selection techniques to solve a regression problem for each gene in the network. More specifically, the basic idea of tree-based methods in regression is to recursively divide the learning sample with binary tests based each on one input variable (the expression of one gene). These binary tests are optimized to minimize, in the largest amount possible, the variance of the output variable, namely, the expression of another gene from the remaining in the subsets of samples. Candidate divisions compare values from the input variable with a threshold, which is determined during the tree growing. Tree-based ensemble methods are more enhanced than single trees, as they estimate the average predictions of several trees. An example of a tree-based method is the GENIE3 algorithm [53], which emerged as the best performer in a significant network inference challenge [39]. In summary, statistical-inference methods are used to estimate the expression pattern relationships across all pairs of genes driving to co-expression network inference. However, correlation-based methods have a tendency to be algorithmically straightforward and computationally fast, but with the limitation that assume linear relationships among variables. In contrast, methods based on mutual information capture non-linear as well as linear interactions but they can be computationally expensive. From another point of view, tree-based methods are non-parametric and consequently, they do not need to make any assumption about the nature of the data. Tree-based methods can deal effectively with high-dimensionality data. Module-based approaches and network signatures Representing high-throughput data with networks often leads to complex and highly dense networks that cannot be easily interpreted by the human eye. To extract biologically meaningful information from these networks and establish links to disease, methods have been developed for scanning and parsing these networks. These methods allow for significant sub-networks to be highlighted in the sea of nodes and edges, often representing important ‘modules’ that are associated with a specific disease [8, 54, 55] (as illustrated in Figure 3C). This type of network ‘traversing’ can be performed between networks obtained from data from different phenotypes from the same disease (staging, subtyping) or from similar diseases (disease hierarchy). Through this way, common/shared modules can be identified by looking at network intersections. Alternatively, unique modules reflecting molecular signatures exclusive to specific conditions or phenotypes can be extracted. Depending on the integration mode, the identification of modules may lead to ‘active modules’ (by integrating molecular profiles and highlighting activity of nodes/interactions), ‘conserved modules’ (by comparing multiple species/states and concluding to conserved subnetworks), ‘differential modules’ (by comparing states and conditions and concluding to differentiated subnetworks) and ‘composite modules’ (by integrating multi-source information from complementary networks) [56]. Module identification can be performed using Systems Bioinformatics approaches and constitutes a powerful tool for delineating the systematic molecular basis of disease. Module identification thus abides by two assumptions: (i) modules that are specific to a disease of interest are expected to form dense clusters or hubs capable of detection by unsupervised network clustering algorithms, (ii) the functional relationship between the nodes residing within these clusters is expected to be similar with respect to underlying molecular mechanisms, biological processes and cellular/tissue localization [57]. Further functional significance of modules can be derived by using pathway enrichment analysis methods, either in the traditional sense (Fisher’s enrichment analysis) or by prioritizing/ranking pathways and genes based on network topological similarities to already validated disease network components. The latter case leads to another type of module functional annotation and assignment based on associations and connection to neighbouring nodes that are of known/validated functions. This approach has been used to elucidate functional and clinical significance of hazy areas of molecular networks for which associations between genes or proteins and specific disease phenotypes are not available in publicly available resources or through basic high-throughput analyses. There are several different types of example cases that have used network methods to gain information on functional modules and network signatures. Novel information for genes, pathways and other molecular interactions involved in numerous disorders has been discovered using network module-based approaches. These include disorders such as type 2 diabetes mellitus [58–61], Alzheimer’s disease (AD) [62, 63], Parkinson’s disease [60, 64], cardiovascular diseases, asthma [61, 65–67] and a variety of tumours [68, 69]. When existing bioinformatics resources and databases fall short in shedding light on a specific disorder of interest, novel experiments have to be designed and conducted to successfully unravel the clinicopathological, genetic and molecular mechanisms underlying disease. Success stories include studies for spinocerebellar ataxia [70], Huntington’s disease [71] and schizophrenia [72–74]. For example, a recent study utilized large-scale expression data to extract/identify biologically significant modules from gene expression networks [75]. It has been hypothesized that disease tissue specificity is governed by the expression of a specific functional disease module (sub-network) in the tissue of disease manifestation [75]. The authors of this study adopted this hypothesis and used systems approaches to investigate the tissue-specific expression patterns of disease genes in the human interactome. They observed that genes expressed in a specific tissue shared a topological neighbourhood in the human interactome network, in contrast to genes expressed in different tissues. The authors further provided evidence that expression of all components of the tissue-specific disease module was necessary for determining the disease outcome. The construction of this tissue-specific disease network further allowed for predictions on novel disease–tissue relationships. Network controllability Network controllability is defined as the potential to steer a network from a given initial state to a final desired state within a finite time and with appropriate inputs/modifications. Such modifications are also known as ‘network attacks’. Another Systems Bioinformatics’ emerging research direction related to the dynamic properties of complex networks is the way in which these properties are spreading and/or reforming during network attacks. The attack process is usually based on specific mathematical models that decide which nodes-edges or even specific hubs to remove, to examine issues like controllability [76], error tolerance [77], attack vulnerability [77, 78], robustness [79, 80], topological characteristics [81] and control centrality [82]. Cascade attacks [83], degree and betweenness-based attacks and prominence-based attacks are commonly used types of attacks [84]. It has been shown that intentional attacks as well as random failures may easily affect (or even destroy) network functions such as connectivity and synchronization [84, 85], while a small range of node failures can affect the network controllability [86]. In the case of Systems Bioinformatics there are relatively limited, yet of great interest, studies that have used such an approach [87]. ‘Driving nodes’ are highly important nodes in a network that governs its controllability. Control theory shows that to direct a complex network towards a desired state, there is a minimum number of driving nodes. Determining the minimum number of driving nodes can be demanding with respect to both computational resources and time, hence novel non-exhaustive algorithms for determining driving nodes are necessary. A recently published study [88] presents the actuation spectrum method that optimizes the trade-off between driving node prediction and time. The authors validate their methodology across numerous complex networks and show that a small number of driving nodes are sufficient to determine the state of a complex network. Another approach makes use of PPI networks and network controllability. By controlling the structure of human PPI networks, using the correct queues or inputs, it is possible to activate specific cellular processes that determine disease outcome (i.e. apoptosis). A recent study [89] utilized a PPI network of 6339 proteins and 34 813 interactions to perform classification of proteins with respect to their importance in the network. The authors quantified the effects of removing a specific protein from the network by calculating the number of remaining driving nodes. Results showed that the most important proteins according to this analysis were also the primary targets of disease-causing mutations, human viruses and drugs. This study showed that controllability of a network can provide crucial information for the shift between healthy and disease states, at the same time highlighting novel candidate drug targets. Network integration A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers (Figure 2). Different disease modules have been shown to act in synergy. Thus, to obtain a holistic picture of the complex mechanisms that underlie disease manifestation, it is necessary to construct networks of integrated disease modules. Such an integration can be achieved via various ways: (i) By investigating gene association it is possible to construct connections between different disease modules by looking at shared, common genes. This can reflect the genetic basis of diseases and provide associations between diseases of a common genetic background. (ii) Superimposing gene networks with gene expression data from RNA-Seq or microarray analysis can further enrich these networks of disease modules. (iii) A common genetic basis for disease modules can also be established by analysing genetic variants or polymorphisms (i.e. SNPs, indels). Networks of disease modules sharing genetic mutations can lead to important findings such as establishment of linkage and associations of variants as well as environmental factors with disease modules. (iv) Protein interaction network modules can also be merged using common PPIs between disease modules and moreover, overlaying this information with proteomics expression data can provide valuable insights into the proteome of diseases of interest. (v) Looking at common pathways between disease sub-networks can also provide valuable clues as to similarities and/or differences between diseases of interest. (vi) Metabolic pathways can also provide additional information towards the understanding of enzyme catalytic activity for different disease modules. Disorders that affect specific metabolic pathways (i.e. obesity) are more likely to share commonalties in the metabolic networks than diseases that share a genetic basis. (vii) Disease modules can also be linked using regulatory information such as shared miRNA regulators, thus highlighting important commonalities or differences between diseases. Specific cellular components (or modules) associated with a disease are believed to share a topological neighbourhood within the human interactome [90]. In a recent study [90] the authors utilized novel mathematical conditions to map the topological relationships between diseases in the human interactome. They showed that diseases with common expression profiles, symptoms and comorbidity share overlapping modules in contrast to more phenotypically distinct diseases, which appear in distant topological neighbourhoods. These tools can provide valuable insights in predicting drug therapy for diseases with common phenotypes, even if they are genetically distinct. Another recent study [91] adopted a novel, multiple-network-framework integration for epigenetic modules. This method utilized the Epigenetic Module based on Differential Networks (EMDN) algorithm, which simultaneously analyses DNA methylation and gene expression data [91]. Using The Cancer Genome Atlas (TCGA) breast cancer data, the authors reported that the EMDN algorithm could recognize positively and negatively correlated modules. These modules can serve as biomarkers to predict/diagnose breast cancer subtypes by using methylation profiles, where positively and negatively correlated modules are of equal importance in the classification of cancer subtypes. The authors of this study also showed that epigenetic modules also estimate the survival time of patients, and this factor is critical for cancer therapy. Tools that analyse the structure and topology of these integrated networks are of extreme value and can provide insights into the synergistic role of multiple network components in diseases of interest. The methods for network analysis and integration we have discussed so far are mainly used to describe the topology of a biological network (or a set of networks). Although these methods capture the relationships between components, they fail to capture the dynamics, i.e. the time component is not modelled and, thus, simulations to obtain prediction of the evolution of the system cannot be performed. For example, static insights into the molecular basis of a disease do not provide a complete picture with regards to drug response without access to time-dependent data. Hence, the use of mathematical algorithms and computational tools for modelling the dynamics of these networks complements network analysis and is further detailed in the following section. Systems modelling and simulation To test the validity and predict the behaviour of complex biochemical systems, such as gene networks, it is often required to describe the effects of multiple, simultaneous and dynamic interactions within the components of the system that are too complex to interpret intuitively. Developing and simulating mathematical models is essential in investigating such complex biological systems. These complex systems can be further explored using mathematical models to describe the valid structure (i.e. the components of the system and their interactions based on experimental data) and identify the basic underlying principles of their function to predict behavioural responses to a certain perturbation [92]. Two types of models are commonly used to describe biological processes such as gene networks—‘quantitative’ and ‘logical’ [93, 94]. Quantitative models use differential equations to describe the non-linear dynamic interactions in a network, whereas logical models use the Boolean approach to describe dynamics in a qualitative way. Quantitative models provide precise information and can be directly compared with experiments including time-dependent data. However, they require sufficient knowledge of the mechanistic details and kinetic parameters and, thus, they are limited to applications to networks which are well characterized and are of small to moderate size. Logical models do not require such information and can be applied to large-scale networks with known structure, yet only provide limited information, as they cannot provide quantitative predictions and assist in choosing better alternative behaviours. In summary, each modelling approach has its advantages and disadvantages and recent work suggests that hybrid approaches might be optimal for challenges in systems biology (for a detailed discussion see [93]). Mathematical models are indispensable in pharmacology and diagnostics. For example, spatio-temporal mathematical models of the blood coagulation network have been developed to aid drug development and diagnostics (as extensively reviewed in [95]). Another relevant application is the use of mathematical models of drug-targeted pathways (modelled with a set of differential equations based on the mass action law) to explore drug combinations [96]. Classical bioinformatics and systems biology can complement and strengthen each other in drug discovery and therapeutics where concrete predictions are required [92]. The value of combining high-throughput data with mathematical modelling is shown, for example, in devising personalized treatments in cancer (for extensive review see [97]). Integration of multi-omic data can be used as an additional constraint in constraint-based modelling in systems biology (to optimize parameter estimation and validation). For a recent survey summarizing constraint-based metabolomic modelling and multi-omic integration methods see [98]. Infrastructures and data management It is important to highlight some of the modern computing trends that play an important role in driving research in Systems Bioinformatics and facilitate the transition to personalized medicine. The main limiting factor for research laboratories specializing in Systems Bioinformatics is computational power and resources. Significant investment is required to attain high-performance computer (HPC) servers or clusters, which have the capacity to store, manage and process the vast amount of data generated from high-throughput omics technologies. Often sheer maintenance of these machines can be a costly and a limiting factor that disallows the exploitation of the full potential of HPC. Cloud computing promises to solve major issues of system administration for these computer clusters by allowing for the exploitation of HPC, stored and managed in an expert environment, as virtual resources that are made available through the internet. Tool availability is also a major issue and having organized platforms with tools like CytoScape [99], GATK [100], BLAST [101], omics assemblers (i.e. IDBA-UD [102]) and programming languages and packages like R's Bioconductor library for expression data analysis [103], JAVA, Python, SQL and others is of major importance for scientists to facilitate dissemination of algorithms, data and results.
Title	Systems Bioinformatics
Section	Systems approaches Biological data have tremendously expanded both in size and complexity. Systems Bioinformatics focuses on the investigation of such vast and complex biological systems and their within interactions using a ‘holistic’ rather than a ‘reductionist’ approach, much like the systems biology field. A holistic approach to science and the analysis and description of a complex phenomenon emphasizes the whole and the interaction of its parts, whereas the reductionist approach focuses on the fundamental parts. In fact, the debate on reductionism versus holism has its roots in ancient years. According to reductionism proponents, the optimal method to understand any science is the decomposition in smaller components. Moreover, in its greedy form, reductionism may see the whole science as physics. Even in its layered-model form, reductionism considers human/health sciences as based on biology, biology based on chemistry and chemistry based on physics. On the other hand, a strong dissent has formulated a solid antireductionism trend. This trend has either epistemological or ontological origins, supporting that complete reductionism is technically impossible and that there are emergent laws that govern the system and cannot be derived from the laws governing the components of the system. Furthermore, it is supported that each system has a ‘buffering capacity’ where many micro-states correspond to fewer macro-states of the system, making reductionism to be considered as pointless after a certain decomposition level [3]. The reductionism’s approach in biology is epitomized by molecular biology, which in the past two decades has led to the generation of a plethora of omics data. These data provide information on the building blocks of the entire organism at different scales and for different types of cells, tissues and organs. Data on DNA fragments, genes, RNA fragments, peptides, proteins and metabolites measured in short time and space intervals provide a spatiotemporal distribution of these building blocks under various states of the organism. These interwoven building blocks control and are controlled by signals in a non-linear way. As a result, the understanding of the system requires something more than simply the bottom-up assembly of the system’ components. Systems theory, which is a holistic approach, addresses the limitations from the reductionism’s point of view by considering the system as a whole, adopting a top-down approach [4]. Thus, it studies the emergent properties of the system such as homeostasis, adaptivity, tolerance, stability and modularity, through some basic overlying hierarchical principles such as entropy, positive and negative feedback control [4]. In the context of Systems approaches, the graph theory and the further science of networks have been successfully applied to the investigation of complex phenomena across a range of different scientific disciplines. The theoretical context of complex networks approach includes concepts that are derived from information theory, dynamical systems, statistical physics and topology approaches, as well as several mathematical methods suited for the analysis of the interaction of components in a complex network. In the following subsection we provide an in-depth overview of such network approaches and examples of their use in Systems Bioinformatics.
Title	Systems approaches
Section	Networks Biological network basics Casting biological systems as networks and analysing their topology can be useful in understanding how such systems are organized. Graph theory provides a powerful mathematical framework for the understanding of the organization of such large and complex systems by considering them in the form of graphs [5, 6]. Graphs, also termed as networks, can be used to model the pairwise relations between objects. A network is a collection of nodes or vertices connected by edges, arcs or lines (as shown in Figure 3A). It may be undirected, meaning that there is no distinction between the two nodes associated with each edge, or its edges may be directed from one node to another (as shown in Figure 3B). In cell biology, nodes represent cellular components (e.g. proteins) and edges represent interactions or other relationships between these components [e.g. protein–protein interactions (PPIs)]. Basic network measures can be used to analyse the components of a network, both locally and globally, and facilitate the analysis and extraction of useful information from a biological network. The most elementary characteristic of a node is its degree, i.e. the number of edges connecting one node to its neighbours. The probability distribution of the degrees over the whole network is called degree distribution. In random networks, most nodes have a similar number of links and their degree distribution follows the Poisson distribution. In contrast, many real-world networks, including most biological networks, are scale free. This means that their degree distribution follows a power law, as most of the nodes have few links and only a few nodes are densely connected [7]. Figure 3 Network Basics. (A) The basic elements of a network are illustrated in this simple network where a circle indicates a node and a line indicates an edge. (B) Networks can be either undirected (upper panel) or directed (lower panel). (C) Hubs (red nodes – or dark grey nodes in Black & White printing) and bottlenecks (green nodes – or medium grey nodes in Black & White printing) are illustrated in this sample graph. Two example modules (green and blue areas – or shadowed areas in Black & White printing) are illustrated as subgroups of nodes and their respective edges. Nodes with high degree, known as ‘hubs’ (illustrated in Figure 3C), can be key players in molecular mechanisms such as (i) a protein interacting with multiple other proteins, (ii) regulation of multiple genes by a key transcription factor or (iii) multipart regulation by other regulatory elements [i.e. microRNAs (miRNAs)]. All these cellular processes may be highly significant in determining the outcome or phenotype of a disease of interest. In human cells, hub genes have been found to indicate essential genes (i.e. critical for survival) rather than disease genes [8]. Another node feature is its betweenness, i.e. the extent to which a node participates in the shortest paths connecting other nodes. Nodes with high betweenness, known as ‘bottlenecks’, can be extremely influential in a network in the sense that they rest in critical junctions between hubs and can therefore represent bridges that allow groups of nodes to cross talk to each other (as illustrated in Figure 3C). Importantly, some of these bottleneck nodes represent key connections that if removed will result in the complete loss of connectivity between clusters of nodes, thus affecting greatly the overall topology and, as a result, the information propagation in the network. In molecular terms, an example of bottleneck nodes is that of proteins whose loss of function leads to deactivation of specific processes. In directed regulatory protein networks, betweenness was shown to be a good predictor of essentiality [9]. Various other network features can be calculated to provide insights into biological networks. Another measure is closeness (a measure of the average length of the shortest paths from one node to other nodes), which indicates important nodes that can communicate quickly with other nodes of the network. For example, in a protein signalling network closeness can be interpreted as the ‘probability’ of a protein to be functionally relevant for several other proteins. An example illustrating how network measures, such as network-efficiency and network-clustering, can be used as biomarkers is the recent study of Blain-Morales et al. [10] where a network was constructed using the alpha bandwidth (8–13 Hz) of the electroencephalogram recordings during anaesthesia in healthy humans. Global network efficiency quantifies the efficiency of information exchange across the whole network and is defined as the average inverse shortest path length over all pairs of nodes. The clustering-coefficient is a measure of the degree to which nodes in a network tend to cluster together (the global measure is calculated by averaging the local clustering-coefficients of all nodes). In Blain-Morales et al. [10] network efficiency was significantly decreased and network clustering-coefficient was significantly increased during anaesthesia-induced unconsciousness. These measures returned to baseline 3 h post-recovery, suggesting that they could be used as potential biomarkers for normal recovery brain networks post general anaesthesia induction. Other network measures such as network size [11], density [11], PageRank versatility [12], path length [10] and modularity [10] can further be used to evaluate networks. For an extensive review of network measures, the reader is referred to [13]. Although topological properties from a graph-theory point of view do not always have a clear biological meaning, in many cases they can be good predictors of functional and disease modules (see [8] for further discussion). Biological network construction methods Biological networks can be split into two broad categories that best characterize their underlying nature: (i) evidence-based molecular networks that rely on experimental evidence for specific molecular interactions such as PPI networks, metabolic networks and regulatory networks (transcription factor—gene networks, non-coding RNA—gene networks) [14–18], (ii) statistically inferred networks, which are based on statistical inference that rely on interactions between components established by means of statistical analysis. Evidence-based molecular networks : The information used to build networks of molecular interactions is obtained from small-, medium- or large-scale experimental data that are usually aggregated and available in online databases [19, 20]. A plethora of information can be derived from multiple resources including PPIs, gene regulatory relationships (including miRNAs) and metabolic pathways, using high-throughput (i.e. whole exome sequencing) and literature-curated data. In addition, valuable pre-compiled information can be derived from databases like Gene Ontology [21], REACTOME [22, 23] and literature-based annotations in Genome Recognition Analysis Internet Link [24]. The construction of biological interaction networks with the goal of uncovering causal relationships constitutes a major research topic in systems biology [25]. Many approaches have been developed to study the interactions among a large number of genes to highlight significant genes for each disease. Certain approaches utilize biological knowledge, to address many biological problems and find genes related to the disease of interest. Construction of networks requires knowledge of PPIs, protein–DNA interactions (PDIs) and/or protein–metabolite interactions (PMIs). Such data can be obtained from open-access databases. For example, PPI data can be obtained from the Search Tool for Recurring Instances of Neighbouring Genes (STRING) [26], the Human Protein Reference Database (HPRD) [27], the Biomolecular Interaction Network Database (BIND) [28], the Molecular INTeraction database (MINT) [29] and the Biological General Repository for Interaction Datasets (BioGRID) [30]. For example, in a recent study, a network-based analysis of mass-spectrometry (MS)-based proteomics data of spinal nerves led to the identification of 19 biological processes to be involved in retrograde motoneurodegeneration and neuroprotection after axonal damage [31]. In this study, the authors used nine public PPI databases to obtain protein interaction data. Furthermore, PDI databases include the EdgeExpressDB (FANTOM4-EEDB) [32], the Transcriptional Regulatory Element Database [33], MSigDB [34], MultiNet [35] and the MetaCore [36]. The KEGG pathway database [37] can be used to obtain PMI data. Nevertheless, as a large number of genes are not functionally characterized, these approaches are compromised owing to lack of available data [38]. Based on this limitation, many statistical network inference methods were developed to construct statistically inferred gene networks, based on omic data from high-throughput technologies, as they provide snapshots of the transcriptome under many tested experimental conditions [39]. Statistically inferred networks: A type of statistical inference network is the ‘co-expression network’, where genes are connected based on statistically significant correlated or anti-correlated (depending on the underlying question) expression profiles with respect to a disease of interest. Another type of statistically generated network is the ‘genetic network’ [40–42]. Sources like the BioGRID [43] database, allow researchers to investigate how the dysregulation of one gene affects the downstream response of another gene and, moreover, how this cascade of molecular functions influences specific disease phenotypes. The basic idea behind the network inference methods is to search for sets of co-expressed genes. Depending on the metric that is used, these methods can be classified into three major categories [38]: (i) Mutual Information-based methods, (ii) Correlation-based methods and (iii) Tree-based methods. Mutual information-based methods calculate the mutual information values of all pairs for a given gene expression profile, and if a pair’s corresponding value is larger than a given threshold then this pair of genes is considered as linked. The resulting network is constructed based on this threshold by including a weighted edge between two genes [44]. The weight can be calculated with several algorithms: ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [45], CLR (Context Likelihood or Relatedness Network) [46], MRNET (Maximum Relevance Minimum Redundancy) [47], MRNETB (Maximum Relevance Minimum Redundancy Backward) [47] and C3NET [48]. In the case of correlation-based methods, the different algorithms calculate the correlation or the partial correlation between pairs of genes. These methods are implemented through algorithms like GeneNet, a statistical learning algorithm which allows the assessment of Graphical Gaussian Models [49], and Weighted Correlation Network Analysis (WGCNA) [50], an algorithm which calculates correlations across each pair of genes. It computes an adjacency matrix using the Spearman correlation, Lasso—a shrinkage and selection method for linear regression [51]—and Adaptive Lasso—another version of Lasso modified to include penalty weights [52]. In the case of tree-based methods, algorithms use tree-based ensemble methods as feature selection techniques to solve a regression problem for each gene in the network. More specifically, the basic idea of tree-based methods in regression is to recursively divide the learning sample with binary tests based each on one input variable (the expression of one gene). These binary tests are optimized to minimize, in the largest amount possible, the variance of the output variable, namely, the expression of another gene from the remaining in the subsets of samples. Candidate divisions compare values from the input variable with a threshold, which is determined during the tree growing. Tree-based ensemble methods are more enhanced than single trees, as they estimate the average predictions of several trees. An example of a tree-based method is the GENIE3 algorithm [53], which emerged as the best performer in a significant network inference challenge [39]. In summary, statistical-inference methods are used to estimate the expression pattern relationships across all pairs of genes driving to co-expression network inference. However, correlation-based methods have a tendency to be algorithmically straightforward and computationally fast, but with the limitation that assume linear relationships among variables. In contrast, methods based on mutual information capture non-linear as well as linear interactions but they can be computationally expensive. From another point of view, tree-based methods are non-parametric and consequently, they do not need to make any assumption about the nature of the data. Tree-based methods can deal effectively with high-dimensionality data. Module-based approaches and network signatures Representing high-throughput data with networks often leads to complex and highly dense networks that cannot be easily interpreted by the human eye. To extract biologically meaningful information from these networks and establish links to disease, methods have been developed for scanning and parsing these networks. These methods allow for significant sub-networks to be highlighted in the sea of nodes and edges, often representing important ‘modules’ that are associated with a specific disease [8, 54, 55] (as illustrated in Figure 3C). This type of network ‘traversing’ can be performed between networks obtained from data from different phenotypes from the same disease (staging, subtyping) or from similar diseases (disease hierarchy). Through this way, common/shared modules can be identified by looking at network intersections. Alternatively, unique modules reflecting molecular signatures exclusive to specific conditions or phenotypes can be extracted. Depending on the integration mode, the identification of modules may lead to ‘active modules’ (by integrating molecular profiles and highlighting activity of nodes/interactions), ‘conserved modules’ (by comparing multiple species/states and concluding to conserved subnetworks), ‘differential modules’ (by comparing states and conditions and concluding to differentiated subnetworks) and ‘composite modules’ (by integrating multi-source information from complementary networks) [56]. Module identification can be performed using Systems Bioinformatics approaches and constitutes a powerful tool for delineating the systematic molecular basis of disease. Module identification thus abides by two assumptions: (i) modules that are specific to a disease of interest are expected to form dense clusters or hubs capable of detection by unsupervised network clustering algorithms, (ii) the functional relationship between the nodes residing within these clusters is expected to be similar with respect to underlying molecular mechanisms, biological processes and cellular/tissue localization [57]. Further functional significance of modules can be derived by using pathway enrichment analysis methods, either in the traditional sense (Fisher’s enrichment analysis) or by prioritizing/ranking pathways and genes based on network topological similarities to already validated disease network components. The latter case leads to another type of module functional annotation and assignment based on associations and connection to neighbouring nodes that are of known/validated functions. This approach has been used to elucidate functional and clinical significance of hazy areas of molecular networks for which associations between genes or proteins and specific disease phenotypes are not available in publicly available resources or through basic high-throughput analyses. There are several different types of example cases that have used network methods to gain information on functional modules and network signatures. Novel information for genes, pathways and other molecular interactions involved in numerous disorders has been discovered using network module-based approaches. These include disorders such as type 2 diabetes mellitus [58–61], Alzheimer’s disease (AD) [62, 63], Parkinson’s disease [60, 64], cardiovascular diseases, asthma [61, 65–67] and a variety of tumours [68, 69]. When existing bioinformatics resources and databases fall short in shedding light on a specific disorder of interest, novel experiments have to be designed and conducted to successfully unravel the clinicopathological, genetic and molecular mechanisms underlying disease. Success stories include studies for spinocerebellar ataxia [70], Huntington’s disease [71] and schizophrenia [72–74]. For example, a recent study utilized large-scale expression data to extract/identify biologically significant modules from gene expression networks [75]. It has been hypothesized that disease tissue specificity is governed by the expression of a specific functional disease module (sub-network) in the tissue of disease manifestation [75]. The authors of this study adopted this hypothesis and used systems approaches to investigate the tissue-specific expression patterns of disease genes in the human interactome. They observed that genes expressed in a specific tissue shared a topological neighbourhood in the human interactome network, in contrast to genes expressed in different tissues. The authors further provided evidence that expression of all components of the tissue-specific disease module was necessary for determining the disease outcome. The construction of this tissue-specific disease network further allowed for predictions on novel disease–tissue relationships. Network controllability Network controllability is defined as the potential to steer a network from a given initial state to a final desired state within a finite time and with appropriate inputs/modifications. Such modifications are also known as ‘network attacks’. Another Systems Bioinformatics’ emerging research direction related to the dynamic properties of complex networks is the way in which these properties are spreading and/or reforming during network attacks. The attack process is usually based on specific mathematical models that decide which nodes-edges or even specific hubs to remove, to examine issues like controllability [76], error tolerance [77], attack vulnerability [77, 78], robustness [79, 80], topological characteristics [81] and control centrality [82]. Cascade attacks [83], degree and betweenness-based attacks and prominence-based attacks are commonly used types of attacks [84]. It has been shown that intentional attacks as well as random failures may easily affect (or even destroy) network functions such as connectivity and synchronization [84, 85], while a small range of node failures can affect the network controllability [86]. In the case of Systems Bioinformatics there are relatively limited, yet of great interest, studies that have used such an approach [87]. ‘Driving nodes’ are highly important nodes in a network that governs its controllability. Control theory shows that to direct a complex network towards a desired state, there is a minimum number of driving nodes. Determining the minimum number of driving nodes can be demanding with respect to both computational resources and time, hence novel non-exhaustive algorithms for determining driving nodes are necessary. A recently published study [88] presents the actuation spectrum method that optimizes the trade-off between driving node prediction and time. The authors validate their methodology across numerous complex networks and show that a small number of driving nodes are sufficient to determine the state of a complex network. Another approach makes use of PPI networks and network controllability. By controlling the structure of human PPI networks, using the correct queues or inputs, it is possible to activate specific cellular processes that determine disease outcome (i.e. apoptosis). A recent study [89] utilized a PPI network of 6339 proteins and 34 813 interactions to perform classification of proteins with respect to their importance in the network. The authors quantified the effects of removing a specific protein from the network by calculating the number of remaining driving nodes. Results showed that the most important proteins according to this analysis were also the primary targets of disease-causing mutations, human viruses and drugs. This study showed that controllability of a network can provide crucial information for the shift between healthy and disease states, at the same time highlighting novel candidate drug targets. Network integration A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers (Figure 2). Different disease modules have been shown to act in synergy. Thus, to obtain a holistic picture of the complex mechanisms that underlie disease manifestation, it is necessary to construct networks of integrated disease modules. Such an integration can be achieved via various ways: (i) By investigating gene association it is possible to construct connections between different disease modules by looking at shared, common genes. This can reflect the genetic basis of diseases and provide associations between diseases of a common genetic background. (ii) Superimposing gene networks with gene expression data from RNA-Seq or microarray analysis can further enrich these networks of disease modules. (iii) A common genetic basis for disease modules can also be established by analysing genetic variants or polymorphisms (i.e. SNPs, indels). Networks of disease modules sharing genetic mutations can lead to important findings such as establishment of linkage and associations of variants as well as environmental factors with disease modules. (iv) Protein interaction network modules can also be merged using common PPIs between disease modules and moreover, overlaying this information with proteomics expression data can provide valuable insights into the proteome of diseases of interest. (v) Looking at common pathways between disease sub-networks can also provide valuable clues as to similarities and/or differences between diseases of interest. (vi) Metabolic pathways can also provide additional information towards the understanding of enzyme catalytic activity for different disease modules. Disorders that affect specific metabolic pathways (i.e. obesity) are more likely to share commonalties in the metabolic networks than diseases that share a genetic basis. (vii) Disease modules can also be linked using regulatory information such as shared miRNA regulators, thus highlighting important commonalities or differences between diseases. Specific cellular components (or modules) associated with a disease are believed to share a topological neighbourhood within the human interactome [90]. In a recent study [90] the authors utilized novel mathematical conditions to map the topological relationships between diseases in the human interactome. They showed that diseases with common expression profiles, symptoms and comorbidity share overlapping modules in contrast to more phenotypically distinct diseases, which appear in distant topological neighbourhoods. These tools can provide valuable insights in predicting drug therapy for diseases with common phenotypes, even if they are genetically distinct. Another recent study [91] adopted a novel, multiple-network-framework integration for epigenetic modules. This method utilized the Epigenetic Module based on Differential Networks (EMDN) algorithm, which simultaneously analyses DNA methylation and gene expression data [91]. Using The Cancer Genome Atlas (TCGA) breast cancer data, the authors reported that the EMDN algorithm could recognize positively and negatively correlated modules. These modules can serve as biomarkers to predict/diagnose breast cancer subtypes by using methylation profiles, where positively and negatively correlated modules are of equal importance in the classification of cancer subtypes. The authors of this study also showed that epigenetic modules also estimate the survival time of patients, and this factor is critical for cancer therapy. Tools that analyse the structure and topology of these integrated networks are of extreme value and can provide insights into the synergistic role of multiple network components in diseases of interest. The methods for network analysis and integration we have discussed so far are mainly used to describe the topology of a biological network (or a set of networks). Although these methods capture the relationships between components, they fail to capture the dynamics, i.e. the time component is not modelled and, thus, simulations to obtain prediction of the evolution of the system cannot be performed. For example, static insights into the molecular basis of a disease do not provide a complete picture with regards to drug response without access to time-dependent data. Hence, the use of mathematical algorithms and computational tools for modelling the dynamics of these networks complements network analysis and is further detailed in the following section.
Title	Networks
Section	Biological network basics Casting biological systems as networks and analysing their topology can be useful in understanding how such systems are organized. Graph theory provides a powerful mathematical framework for the understanding of the organization of such large and complex systems by considering them in the form of graphs [5, 6]. Graphs, also termed as networks, can be used to model the pairwise relations between objects. A network is a collection of nodes or vertices connected by edges, arcs or lines (as shown in Figure 3A). It may be undirected, meaning that there is no distinction between the two nodes associated with each edge, or its edges may be directed from one node to another (as shown in Figure 3B). In cell biology, nodes represent cellular components (e.g. proteins) and edges represent interactions or other relationships between these components [e.g. protein–protein interactions (PPIs)]. Basic network measures can be used to analyse the components of a network, both locally and globally, and facilitate the analysis and extraction of useful information from a biological network. The most elementary characteristic of a node is its degree, i.e. the number of edges connecting one node to its neighbours. The probability distribution of the degrees over the whole network is called degree distribution. In random networks, most nodes have a similar number of links and their degree distribution follows the Poisson distribution. In contrast, many real-world networks, including most biological networks, are scale free. This means that their degree distribution follows a power law, as most of the nodes have few links and only a few nodes are densely connected [7]. Figure 3 Network Basics. (A) The basic elements of a network are illustrated in this simple network where a circle indicates a node and a line indicates an edge. (B) Networks can be either undirected (upper panel) or directed (lower panel). (C) Hubs (red nodes – or dark grey nodes in Black & White printing) and bottlenecks (green nodes – or medium grey nodes in Black & White printing) are illustrated in this sample graph. Two example modules (green and blue areas – or shadowed areas in Black & White printing) are illustrated as subgroups of nodes and their respective edges. Nodes with high degree, known as ‘hubs’ (illustrated in Figure 3C), can be key players in molecular mechanisms such as (i) a protein interacting with multiple other proteins, (ii) regulation of multiple genes by a key transcription factor or (iii) multipart regulation by other regulatory elements [i.e. microRNAs (miRNAs)]. All these cellular processes may be highly significant in determining the outcome or phenotype of a disease of interest. In human cells, hub genes have been found to indicate essential genes (i.e. critical for survival) rather than disease genes [8]. Another node feature is its betweenness, i.e. the extent to which a node participates in the shortest paths connecting other nodes. Nodes with high betweenness, known as ‘bottlenecks’, can be extremely influential in a network in the sense that they rest in critical junctions between hubs and can therefore represent bridges that allow groups of nodes to cross talk to each other (as illustrated in Figure 3C). Importantly, some of these bottleneck nodes represent key connections that if removed will result in the complete loss of connectivity between clusters of nodes, thus affecting greatly the overall topology and, as a result, the information propagation in the network. In molecular terms, an example of bottleneck nodes is that of proteins whose loss of function leads to deactivation of specific processes. In directed regulatory protein networks, betweenness was shown to be a good predictor of essentiality [9]. Various other network features can be calculated to provide insights into biological networks. Another measure is closeness (a measure of the average length of the shortest paths from one node to other nodes), which indicates important nodes that can communicate quickly with other nodes of the network. For example, in a protein signalling network closeness can be interpreted as the ‘probability’ of a protein to be functionally relevant for several other proteins. An example illustrating how network measures, such as network-efficiency and network-clustering, can be used as biomarkers is the recent study of Blain-Morales et al. [10] where a network was constructed using the alpha bandwidth (8–13 Hz) of the electroencephalogram recordings during anaesthesia in healthy humans. Global network efficiency quantifies the efficiency of information exchange across the whole network and is defined as the average inverse shortest path length over all pairs of nodes. The clustering-coefficient is a measure of the degree to which nodes in a network tend to cluster together (the global measure is calculated by averaging the local clustering-coefficients of all nodes). In Blain-Morales et al. [10] network efficiency was significantly decreased and network clustering-coefficient was significantly increased during anaesthesia-induced unconsciousness. These measures returned to baseline 3 h post-recovery, suggesting that they could be used as potential biomarkers for normal recovery brain networks post general anaesthesia induction. Other network measures such as network size [11], density [11], PageRank versatility [12], path length [10] and modularity [10] can further be used to evaluate networks. For an extensive review of network measures, the reader is referred to [13]. Although topological properties from a graph-theory point of view do not always have a clear biological meaning, in many cases they can be good predictors of functional and disease modules (see [8] for further discussion).
Title	Biological network basics
Figure caption	Figure 3 Network Basics. (A) The basic elements of a network are illustrated in this simple network where a circle indicates a node and a line indicates an edge. (B) Networks can be either undirected (upper panel) or directed (lower panel). (C) Hubs (red nodes – or dark grey nodes in Black & White printing) and bottlenecks (green nodes – or medium grey nodes in Black & White printing) are illustrated in this sample graph. Two example modules (green and blue areas – or shadowed areas in Black & White printing) are illustrated as subgroups of nodes and their respective edges.
Section	Biological network construction methods Biological networks can be split into two broad categories that best characterize their underlying nature: (i) evidence-based molecular networks that rely on experimental evidence for specific molecular interactions such as PPI networks, metabolic networks and regulatory networks (transcription factor—gene networks, non-coding RNA—gene networks) [14–18], (ii) statistically inferred networks, which are based on statistical inference that rely on interactions between components established by means of statistical analysis. Evidence-based molecular networks : The information used to build networks of molecular interactions is obtained from small-, medium- or large-scale experimental data that are usually aggregated and available in online databases [19, 20]. A plethora of information can be derived from multiple resources including PPIs, gene regulatory relationships (including miRNAs) and metabolic pathways, using high-throughput (i.e. whole exome sequencing) and literature-curated data. In addition, valuable pre-compiled information can be derived from databases like Gene Ontology [21], REACTOME [22, 23] and literature-based annotations in Genome Recognition Analysis Internet Link [24]. The construction of biological interaction networks with the goal of uncovering causal relationships constitutes a major research topic in systems biology [25]. Many approaches have been developed to study the interactions among a large number of genes to highlight significant genes for each disease. Certain approaches utilize biological knowledge, to address many biological problems and find genes related to the disease of interest. Construction of networks requires knowledge of PPIs, protein–DNA interactions (PDIs) and/or protein–metabolite interactions (PMIs). Such data can be obtained from open-access databases. For example, PPI data can be obtained from the Search Tool for Recurring Instances of Neighbouring Genes (STRING) [26], the Human Protein Reference Database (HPRD) [27], the Biomolecular Interaction Network Database (BIND) [28], the Molecular INTeraction database (MINT) [29] and the Biological General Repository for Interaction Datasets (BioGRID) [30]. For example, in a recent study, a network-based analysis of mass-spectrometry (MS)-based proteomics data of spinal nerves led to the identification of 19 biological processes to be involved in retrograde motoneurodegeneration and neuroprotection after axonal damage [31]. In this study, the authors used nine public PPI databases to obtain protein interaction data. Furthermore, PDI databases include the EdgeExpressDB (FANTOM4-EEDB) [32], the Transcriptional Regulatory Element Database [33], MSigDB [34], MultiNet [35] and the MetaCore [36]. The KEGG pathway database [37] can be used to obtain PMI data. Nevertheless, as a large number of genes are not functionally characterized, these approaches are compromised owing to lack of available data [38]. Based on this limitation, many statistical network inference methods were developed to construct statistically inferred gene networks, based on omic data from high-throughput technologies, as they provide snapshots of the transcriptome under many tested experimental conditions [39]. Statistically inferred networks: A type of statistical inference network is the ‘co-expression network’, where genes are connected based on statistically significant correlated or anti-correlated (depending on the underlying question) expression profiles with respect to a disease of interest. Another type of statistically generated network is the ‘genetic network’ [40–42]. Sources like the BioGRID [43] database, allow researchers to investigate how the dysregulation of one gene affects the downstream response of another gene and, moreover, how this cascade of molecular functions influences specific disease phenotypes. The basic idea behind the network inference methods is to search for sets of co-expressed genes. Depending on the metric that is used, these methods can be classified into three major categories [38]: (i) Mutual Information-based methods, (ii) Correlation-based methods and (iii) Tree-based methods. Mutual information-based methods calculate the mutual information values of all pairs for a given gene expression profile, and if a pair’s corresponding value is larger than a given threshold then this pair of genes is considered as linked. The resulting network is constructed based on this threshold by including a weighted edge between two genes [44]. The weight can be calculated with several algorithms: ARACNE (Algorithm for the Reconstruction of Accurate Cellular Networks) [45], CLR (Context Likelihood or Relatedness Network) [46], MRNET (Maximum Relevance Minimum Redundancy) [47], MRNETB (Maximum Relevance Minimum Redundancy Backward) [47] and C3NET [48]. In the case of correlation-based methods, the different algorithms calculate the correlation or the partial correlation between pairs of genes. These methods are implemented through algorithms like GeneNet, a statistical learning algorithm which allows the assessment of Graphical Gaussian Models [49], and Weighted Correlation Network Analysis (WGCNA) [50], an algorithm which calculates correlations across each pair of genes. It computes an adjacency matrix using the Spearman correlation, Lasso—a shrinkage and selection method for linear regression [51]—and Adaptive Lasso—another version of Lasso modified to include penalty weights [52]. In the case of tree-based methods, algorithms use tree-based ensemble methods as feature selection techniques to solve a regression problem for each gene in the network. More specifically, the basic idea of tree-based methods in regression is to recursively divide the learning sample with binary tests based each on one input variable (the expression of one gene). These binary tests are optimized to minimize, in the largest amount possible, the variance of the output variable, namely, the expression of another gene from the remaining in the subsets of samples. Candidate divisions compare values from the input variable with a threshold, which is determined during the tree growing. Tree-based ensemble methods are more enhanced than single trees, as they estimate the average predictions of several trees. An example of a tree-based method is the GENIE3 algorithm [53], which emerged as the best performer in a significant network inference challenge [39]. In summary, statistical-inference methods are used to estimate the expression pattern relationships across all pairs of genes driving to co-expression network inference. However, correlation-based methods have a tendency to be algorithmically straightforward and computationally fast, but with the limitation that assume linear relationships among variables. In contrast, methods based on mutual information capture non-linear as well as linear interactions but they can be computationally expensive. From another point of view, tree-based methods are non-parametric and consequently, they do not need to make any assumption about the nature of the data. Tree-based methods can deal effectively with high-dimensionality data.
Title	Biological network construction methods
Section	Module-based approaches and network signatures Representing high-throughput data with networks often leads to complex and highly dense networks that cannot be easily interpreted by the human eye. To extract biologically meaningful information from these networks and establish links to disease, methods have been developed for scanning and parsing these networks. These methods allow for significant sub-networks to be highlighted in the sea of nodes and edges, often representing important ‘modules’ that are associated with a specific disease [8, 54, 55] (as illustrated in Figure 3C). This type of network ‘traversing’ can be performed between networks obtained from data from different phenotypes from the same disease (staging, subtyping) or from similar diseases (disease hierarchy). Through this way, common/shared modules can be identified by looking at network intersections. Alternatively, unique modules reflecting molecular signatures exclusive to specific conditions or phenotypes can be extracted. Depending on the integration mode, the identification of modules may lead to ‘active modules’ (by integrating molecular profiles and highlighting activity of nodes/interactions), ‘conserved modules’ (by comparing multiple species/states and concluding to conserved subnetworks), ‘differential modules’ (by comparing states and conditions and concluding to differentiated subnetworks) and ‘composite modules’ (by integrating multi-source information from complementary networks) [56]. Module identification can be performed using Systems Bioinformatics approaches and constitutes a powerful tool for delineating the systematic molecular basis of disease. Module identification thus abides by two assumptions: (i) modules that are specific to a disease of interest are expected to form dense clusters or hubs capable of detection by unsupervised network clustering algorithms, (ii) the functional relationship between the nodes residing within these clusters is expected to be similar with respect to underlying molecular mechanisms, biological processes and cellular/tissue localization [57]. Further functional significance of modules can be derived by using pathway enrichment analysis methods, either in the traditional sense (Fisher’s enrichment analysis) or by prioritizing/ranking pathways and genes based on network topological similarities to already validated disease network components. The latter case leads to another type of module functional annotation and assignment based on associations and connection to neighbouring nodes that are of known/validated functions. This approach has been used to elucidate functional and clinical significance of hazy areas of molecular networks for which associations between genes or proteins and specific disease phenotypes are not available in publicly available resources or through basic high-throughput analyses. There are several different types of example cases that have used network methods to gain information on functional modules and network signatures. Novel information for genes, pathways and other molecular interactions involved in numerous disorders has been discovered using network module-based approaches. These include disorders such as type 2 diabetes mellitus [58–61], Alzheimer’s disease (AD) [62, 63], Parkinson’s disease [60, 64], cardiovascular diseases, asthma [61, 65–67] and a variety of tumours [68, 69]. When existing bioinformatics resources and databases fall short in shedding light on a specific disorder of interest, novel experiments have to be designed and conducted to successfully unravel the clinicopathological, genetic and molecular mechanisms underlying disease. Success stories include studies for spinocerebellar ataxia [70], Huntington’s disease [71] and schizophrenia [72–74]. For example, a recent study utilized large-scale expression data to extract/identify biologically significant modules from gene expression networks [75]. It has been hypothesized that disease tissue specificity is governed by the expression of a specific functional disease module (sub-network) in the tissue of disease manifestation [75]. The authors of this study adopted this hypothesis and used systems approaches to investigate the tissue-specific expression patterns of disease genes in the human interactome. They observed that genes expressed in a specific tissue shared a topological neighbourhood in the human interactome network, in contrast to genes expressed in different tissues. The authors further provided evidence that expression of all components of the tissue-specific disease module was necessary for determining the disease outcome. The construction of this tissue-specific disease network further allowed for predictions on novel disease–tissue relationships.
Title	Module-based approaches and network signatures
Section	Network controllability Network controllability is defined as the potential to steer a network from a given initial state to a final desired state within a finite time and with appropriate inputs/modifications. Such modifications are also known as ‘network attacks’. Another Systems Bioinformatics’ emerging research direction related to the dynamic properties of complex networks is the way in which these properties are spreading and/or reforming during network attacks. The attack process is usually based on specific mathematical models that decide which nodes-edges or even specific hubs to remove, to examine issues like controllability [76], error tolerance [77], attack vulnerability [77, 78], robustness [79, 80], topological characteristics [81] and control centrality [82]. Cascade attacks [83], degree and betweenness-based attacks and prominence-based attacks are commonly used types of attacks [84]. It has been shown that intentional attacks as well as random failures may easily affect (or even destroy) network functions such as connectivity and synchronization [84, 85], while a small range of node failures can affect the network controllability [86]. In the case of Systems Bioinformatics there are relatively limited, yet of great interest, studies that have used such an approach [87]. ‘Driving nodes’ are highly important nodes in a network that governs its controllability. Control theory shows that to direct a complex network towards a desired state, there is a minimum number of driving nodes. Determining the minimum number of driving nodes can be demanding with respect to both computational resources and time, hence novel non-exhaustive algorithms for determining driving nodes are necessary. A recently published study [88] presents the actuation spectrum method that optimizes the trade-off between driving node prediction and time. The authors validate their methodology across numerous complex networks and show that a small number of driving nodes are sufficient to determine the state of a complex network. Another approach makes use of PPI networks and network controllability. By controlling the structure of human PPI networks, using the correct queues or inputs, it is possible to activate specific cellular processes that determine disease outcome (i.e. apoptosis). A recent study [89] utilized a PPI network of 6339 proteins and 34 813 interactions to perform classification of proteins with respect to their importance in the network. The authors quantified the effects of removing a specific protein from the network by calculating the number of remaining driving nodes. Results showed that the most important proteins according to this analysis were also the primary targets of disease-causing mutations, human viruses and drugs. This study showed that controllability of a network can provide crucial information for the shift between healthy and disease states, at the same time highlighting novel candidate drug targets.
Title	Network controllability
Section	Network integration A key approach in Systems Bioinformatics is the construction of multiple networks representing each level of the omics spectrum and their integration in a layered network that exchanges information within and between layers (Figure 2). Different disease modules have been shown to act in synergy. Thus, to obtain a holistic picture of the complex mechanisms that underlie disease manifestation, it is necessary to construct networks of integrated disease modules. Such an integration can be achieved via various ways: (i) By investigating gene association it is possible to construct connections between different disease modules by looking at shared, common genes. This can reflect the genetic basis of diseases and provide associations between diseases of a common genetic background. (ii) Superimposing gene networks with gene expression data from RNA-Seq or microarray analysis can further enrich these networks of disease modules. (iii) A common genetic basis for disease modules can also be established by analysing genetic variants or polymorphisms (i.e. SNPs, indels). Networks of disease modules sharing genetic mutations can lead to important findings such as establishment of linkage and associations of variants as well as environmental factors with disease modules. (iv) Protein interaction network modules can also be merged using common PPIs between disease modules and moreover, overlaying this information with proteomics expression data can provide valuable insights into the proteome of diseases of interest. (v) Looking at common pathways between disease sub-networks can also provide valuable clues as to similarities and/or differences between diseases of interest. (vi) Metabolic pathways can also provide additional information towards the understanding of enzyme catalytic activity for different disease modules. Disorders that affect specific metabolic pathways (i.e. obesity) are more likely to share commonalties in the metabolic networks than diseases that share a genetic basis. (vii) Disease modules can also be linked using regulatory information such as shared miRNA regulators, thus highlighting important commonalities or differences between diseases. Specific cellular components (or modules) associated with a disease are believed to share a topological neighbourhood within the human interactome [90]. In a recent study [90] the authors utilized novel mathematical conditions to map the topological relationships between diseases in the human interactome. They showed that diseases with common expression profiles, symptoms and comorbidity share overlapping modules in contrast to more phenotypically distinct diseases, which appear in distant topological neighbourhoods. These tools can provide valuable insights in predicting drug therapy for diseases with common phenotypes, even if they are genetically distinct. Another recent study [91] adopted a novel, multiple-network-framework integration for epigenetic modules. This method utilized the Epigenetic Module based on Differential Networks (EMDN) algorithm, which simultaneously analyses DNA methylation and gene expression data [91]. Using The Cancer Genome Atlas (TCGA) breast cancer data, the authors reported that the EMDN algorithm could recognize positively and negatively correlated modules. These modules can serve as biomarkers to predict/diagnose breast cancer subtypes by using methylation profiles, where positively and negatively correlated modules are of equal importance in the classification of cancer subtypes. The authors of this study also showed that epigenetic modules also estimate the survival time of patients, and this factor is critical for cancer therapy. Tools that analyse the structure and topology of these integrated networks are of extreme value and can provide insights into the synergistic role of multiple network components in diseases of interest. The methods for network analysis and integration we have discussed so far are mainly used to describe the topology of a biological network (or a set of networks). Although these methods capture the relationships between components, they fail to capture the dynamics, i.e. the time component is not modelled and, thus, simulations to obtain prediction of the evolution of the system cannot be performed. For example, static insights into the molecular basis of a disease do not provide a complete picture with regards to drug response without access to time-dependent data. Hence, the use of mathematical algorithms and computational tools for modelling the dynamics of these networks complements network analysis and is further detailed in the following section.
Title	Network integration
Section	Systems modelling and simulation To test the validity and predict the behaviour of complex biochemical systems, such as gene networks, it is often required to describe the effects of multiple, simultaneous and dynamic interactions within the components of the system that are too complex to interpret intuitively. Developing and simulating mathematical models is essential in investigating such complex biological systems. These complex systems can be further explored using mathematical models to describe the valid structure (i.e. the components of the system and their interactions based on experimental data) and identify the basic underlying principles of their function to predict behavioural responses to a certain perturbation [92]. Two types of models are commonly used to describe biological processes such as gene networks—‘quantitative’ and ‘logical’ [93, 94]. Quantitative models use differential equations to describe the non-linear dynamic interactions in a network, whereas logical models use the Boolean approach to describe dynamics in a qualitative way. Quantitative models provide precise information and can be directly compared with experiments including time-dependent data. However, they require sufficient knowledge of the mechanistic details and kinetic parameters and, thus, they are limited to applications to networks which are well characterized and are of small to moderate size. Logical models do not require such information and can be applied to large-scale networks with known structure, yet only provide limited information, as they cannot provide quantitative predictions and assist in choosing better alternative behaviours. In summary, each modelling approach has its advantages and disadvantages and recent work suggests that hybrid approaches might be optimal for challenges in systems biology (for a detailed discussion see [93]). Mathematical models are indispensable in pharmacology and diagnostics. For example, spatio-temporal mathematical models of the blood coagulation network have been developed to aid drug development and diagnostics (as extensively reviewed in [95]). Another relevant application is the use of mathematical models of drug-targeted pathways (modelled with a set of differential equations based on the mass action law) to explore drug combinations [96]. Classical bioinformatics and systems biology can complement and strengthen each other in drug discovery and therapeutics where concrete predictions are required [92]. The value of combining high-throughput data with mathematical modelling is shown, for example, in devising personalized treatments in cancer (for extensive review see [97]). Integration of multi-omic data can be used as an additional constraint in constraint-based modelling in systems biology (to optimize parameter estimation and validation). For a recent survey summarizing constraint-based metabolomic modelling and multi-omic integration methods see [98].
Title	Systems modelling and simulation
Section	Infrastructures and data management It is important to highlight some of the modern computing trends that play an important role in driving research in Systems Bioinformatics and facilitate the transition to personalized medicine. The main limiting factor for research laboratories specializing in Systems Bioinformatics is computational power and resources. Significant investment is required to attain high-performance computer (HPC) servers or clusters, which have the capacity to store, manage and process the vast amount of data generated from high-throughput omics technologies. Often sheer maintenance of these machines can be a costly and a limiting factor that disallows the exploitation of the full potential of HPC. Cloud computing promises to solve major issues of system administration for these computer clusters by allowing for the exploitation of HPC, stored and managed in an expert environment, as virtual resources that are made available through the internet. Tool availability is also a major issue and having organized platforms with tools like CytoScape [99], GATK [100], BLAST [101], omics assemblers (i.e. IDBA-UD [102]) and programming languages and packages like R's Bioconductor library for expression data analysis [103], JAVA, Python, SQL and others is of major importance for scientists to facilitate dissemination of algorithms, data and results.
Title	Infrastructures and data management
Section	Systems bioinformatics applications In this section we present the impact of Systems Bioinformatics on diagnostics and therapeutics by highlighting success stories and cases in a formatted manner: introductory text/data set collection/network construction/network analysis/findings and significance of research. Systems bioinformatics applications in biomarker discovery The use of networks in computational diagnostics via the detection of molecular biomarkers is one of the hallmarks of Systems Bioinformatics. Numerous recent state-of-the-art studies have made use of such networks to characterize cellular systems by simultaneously analysing thousands of genes, proteins, isoforms and complexes to address issues of computational diagnostics. Here, we highlight a few studies, showcasing the essence of networks’ contribution in precision diagnostics. A recent study [62] used a machine learning approach, which integrates topological features from PPI networks, to identify candidate AD-associated genes. Data set collection: Positive and negative data sets were collected from Entrez Gene database at the National Centre for Biotechnology Information (NCBI). The positive data set consisted of 458 genes known to be associated with AD. The negative data set consisted of the additional 55 947 Entrez genes, excluding the AD-associated genes. Network construction: Human PPI data sets were extracted from a variety of sources including Online Predicted Human Interaction Database (OPID), STRING, MINT, BIND and InTAct databases. Network analysis: By utilizing the PPI networks, the authors extracted topological features for the AD- and non-AD-associated genes. These features included nine topological properties of the PPI network for each gene, namely, the average shortest path length, betweenness centrality, closeness centrality, clustering coefficient, degree, eccentricity, neighbourhood connectivity, topological coefficient and radiality. Findings and significance of research: The authors further combined sequence features and functional annotations features and concurrently performed feature selection using seven methods including gain-ratio-based attribute evaluation, oneR algorithm, chi-square-based selection, correlation-based selection, information gain-based attribute evaluation and relief-based selection. The most important features were fed into 11 machine learning algorithms to generate classifiers using the training data set capable of predicting AD- and non-AD-associated genes using the selected network, sequence and functional features. Methods included Naive Bayes (NB), NB Tree, Bayes Net, Decision table/NB hybrid classifier, Random Forest, J48, Functional Tree, Locally Weighted Learning (J48 + k-nearest neighbour), Logistic Regression and Support Vector Machine. Training of sophisticated machine learning classifiers using systemic properties can be a key feature in generating personalized medicine diagnostic approaches. The authors finally combined diagnostics with therapeutics by screening 45 known anti-Alzheimer drugs from DrugBank against novel predicted probable AD targets, obtained from their trained classifiers, using molecular docking. They further proposed a novel candidate untried drug, AL-108, with high affinity to potential therapeutic targets. Additional tools were also used to validate preliminary findings, including molecular dynamics simulations and MM/GBSA calculations on the docked complexes [62]. Another interesting study [104] used data from TCGA [105] to successfully construct a multidimensional subnetwork atlas for cancer prognosis. The authors addressed how multiple genetic and epigenetic factors (i.e. gene expression, copy number variation, miRNA expression and DNA methylation) affect molecular states of networks and patient survival. Dat aset collection: The multidimensional cancer-associated data sets for 1027 patients for four cancer types were collected from TCGA Cancer Browser (https://genome-can cer.ucsc.edu/proj/site/hgHeatmap/). They contained clinical information, copy-number variation, promoter DNA methylation, mRNA-gene and miRNA expression data. They furthermore extracted PPI data from HPRD for network construction. To enrich these networks with additional miRNA-regulatory information the authors extracted miRNA and target gene information from two miRNA target databases [miRTarBase (Release 4.5) and TarBase v6], which provide experimentally validated miRNA–target interactions. Network construction: PPI interaction network was constructed using data collected from HPRD. Network analysis: The authors fitted a univariate Cox proportional hazards model between each molecular feature and patient survival time and thus scored each gene based on its significance to predict survival. Genes with a positive score were considered as survival-related genes. They next used this score (heat score) as the input into HotNet2, which uses a heat diffusion process and a statistical test-based algorithm to discover subnetwork signatures in the PPI network. Through this way subnetwork signatures of survival-related genes were determined both by the scores of their genes as well as gene topology in the PPI network. Findings and significance of research: The authors then used Monte Carlo cross-validation and permutation testing procedure to assess predictive power of the subnetworks on patient overall survival. They used a Cox proportional hazards model with L1 penalized log partial likelihood (LASSO) for feature selection to train the models based on the molecular profile of individual subnetworks. Finally, the prognostic outcomes for the training set were used to determine the regression coefficients. These coefficients were then used in the testing model to predict outcomes for patients in the test set and calculate the concordance index (C-index). Results reveal novel PPI subnetworks with significant prognostic capabilities for a variety of cancer types. The authors further validated their subnetworks by performing prognostic impact evaluation, functional enrichment analysis, drug target annotation, tumour stratification and independent validation. They highlighted distinct pathways in the underlying subnetworks as potential new targets for therapeutic intervention for certain cancer types. This study integrated the protein interactome with cancer genomics data, thus allowing for a systemic analysis of the molecular mechanisms that underlie genesis of cancer and provides new directions in personalized cancer therapy [104]. Another recent study [106] adopted an approach that uses an enriched library of single-stranded oligodeoxynucleotides to profile complex biological samples. This method allows for the analysis of systemic native biomolecules. The authors defined their method as Adaptive Dynamic Artificial Poly-ligand Targeting and further utilized it as a diagnostic tool to profile plasma exosome of cancer patients. They achieved high classification accuracy in breast cancer patients by analysing the circulating exosomes in their blood [106]. The online database MelGene is yet another example of successful integration of Systems Bioinformatics approaches in current research for molecular diagnostics. This tool provides a comprehensive, regularly updated collection of data from genetic association studies in cutaneous melanoma, including random-effects meta-analysis results of all eligible polymorphisms [107]. The MelGene proposed network connections highlight potentially new loci in relation to melanoma risk. Recent studies have shown that interpretation of proteomics data using network-based approaches can offer additional insights into the mechanistic and dynamics of protein assemblies, and hence into the molecular mechanisms underlying the system under study. Moreover, network-based approaches can be used to reconstruct a disease-perturbed cellular network model showing the interactions of identified differentially expressed proteins involved in selected cellular pathways related to the target pathophysiology. For example, Shirasaki et al. [108] have used affinity purification coupled to MS to investigate the proteome profile of Huntington’s disease. In particular, using a monoclonal antibody against huntingtin (Htt), they identified 747 proteins to be complexed with Htt. A systems-level view of Htt interactome was achieved by using WGCNA, which was used to construct weighted links between the Htt co-purifying proteins. Using topological overlap, the data were clustered into eight Htt-interactome modules that were related to distinct functional aspects such as brain region specificity, aging and protein aggregation modulation or Htt functions directly [108]. Moreover, several network-based approaches have been developed that can identify the cellular pathways which are altered under pathophysiological conditions, and can hence enrich biomarker discovery. For example, functional enrichment analysis of GO biological processes or KEGG pathways [37] of differentially expressed proteins can be performed using both free licence tools such as Database for Annotation, Visualization and Integrated Discovery (DAVID) [109], Protein ANalysis THrough Evolutionary Relationships (PANTHER) [110] and Gene Set Enrichment Analysis (GSEA) [34], as well as commercialized tools such as MetaCore [36] and Ingenuity Pathway Analysis. Furthermore, pathway topology approaches have been developed as alternative to enrichment analysis. For example, Signalling Pathway Impact Analysis [111] and Network Perturbation Amplitude [112] deliberate whether the proteins involved in functional modules defined by other databases interact with each other in cellular networks. Various tools are currently available, which can aid the Systems Bioinformatics application in biomarker discovery. GWAB, a recent tool, makes use of systems approaches and computational methods to boost weak association signals for Genome Wide Association Studies (GWAS), a common problem when analysing this type of data. This tool works by incorporating publicly available data in the form of using GWAS summary statistics (p-values) for SNPs along with reference genes for a disease of interest. The authors demonstrated the feasibility of boosting GWAS disease associations using gene networks and further present a web server for GWAB, for the network-based boosting of human GWAS data [113]. Other tools like GeneMANIA [114] and PINTA [115] allow for gene prioritization and gene function prediction and can greatly aid in computational diagnostics. Systems Bioinformatics applications in drug discovery Systems Bioinformatics contributes in computational therapeutics by providing tools and algorithms for novel drug discovery. Research in this direction is often done in close collaboration with pharmaceutical companies. One of the main challenges faced by both the research community and the industry is the prediction of adverse drug effects, especially during the early stages of drug development. These types of predictions can lead to significant cost reductions by allowing for accurate drug assessment and discontinuation of development for drugs with severe adverse effects. The use of human genetic variation has been known to play an important role in drug response [116]; however, the effect of this factor alone is not sufficient to provide a complete perspective on the matter in hand. Systems pharmacology is a term that is widely used today in many high-calibre, recently published studies [117–119]. Systems pharmacology is a systems biology approach, which focuses on enhancing the understanding of drugs function in the human body at a systems’ level, described by several types of networks, rather than looking at the effect of single molecular components. It shifts away from traditional practice, which considers the effects of a drug with respect to its target protein and instead strives to address the effects of the drug by considering a network of drug–target interactions. Systems Bioinformatics is a precious field in the neighbourhood of systems pharmacology that provides important methods and tools for multi-source and multilevel integration of the omics spectrum with drug networks shedding light in the area of modern drug discovery. A recent area of great interest where Systems Bioinformatics can be of substantial impact and value is the area of drug repurposing or repositioning [120]. This entails the use of Food and Drug Administration (FDA)-approved drugs to treat new diseases, which are different from the ones they were initially designed for. This allows for obvious shortcuts for pharmaceutical companies allowing them to by-pass the timely and costly process of FDA approval for novel drugs. Recent studies used gene expression data derived from microarrays or RNA-Seq data to obtain specific expression profiles for specific diseases of interest. By comparing these to collections of data sets from repositories such as CMap [121], Drugmap Central and more advanced versions like LINCS and the recent Drug Repurposing Hub [121, 122] allows for alternative drugs to be proposed for the treatment of diseases under investigation. In a recent study [38], this approach was used to devise drug/target networks obtained from algorithms of mutual information and co-expression networks aiming to gain insights into the treatment of breast cancer subtypes. Data collection: TCGA mRNA (microarray) gene expression data for Breast Invasive Carcinoma cases were obtained from Firehose (http://gdac.broadinstitute.org/). From a total of 587 samples (526 primary solid tumour samples and 61 primary solid normal samples—17.814 genes), the authors selected a subset of tumour data containing information regarding breast cancer staging, HER2, ER and PR status with their corresponding normal samples as well as breast cancer stages I, II, III and IV. Network construction: The authors examined three major categories of statistical network inference methods: (i) mutual information-based methods, (ii) correlation-based methods and (iii) tree-based methods. They further utilized Biological information-based network methods and one ensemble scheme using all statistical network inference methods. They used the Cytoscape platform and more specifically the GeneMania plug-in to reconstruct the biological information-based gene network. This plug-in uses a large data set unifying functional networks comprising approximately 800 networks for six organisms including Homo sapiens. Using the H.sapiens network they constructed a sub-network for the top 1000 differentially expressed genes (DEGs) from the TCGA data set merging five Network types: Co-expression, Physical Interaction, Genetic interaction, Co-localization and Pathways. Network analysis: The authors further performed gene re-ranking using the underlying networks. To investigate the influence of the reconstructed 17 gene networks (12 statistically and 5 biologically inferred) on gene prioritization, they applied a method that allows for a custom network selection combining the log fold change absolute values with the selected underlying network topology to re-rank the initial DEGs. The basic idea of the method is the reconciliation of the gene expression values taking into account the underlying gene network topological features such as degree and betweenness. The network patterns were further analysed to investigate their exclusive contribution with respect to breast cancer subtypes and stages. The authors then performed drug repurposing using the up- and down-regulated genes forming disease signatures by querying them in a well-established drug repurposing pipeline, namely, LINCS-L1000 (http://www.lincscloud.org/), an advanced version of CMap. In summary, the authors obtained 63 unique drugs for the breast cancer stages and 58 for the breast cancer subtypes. To further examine the resulting drugs, the authors constructed super networks by combining top drugs extracted from their analysis with the FDA-approved breast cancer drugs, connecting them with their target genes and superimposing these on the gene expression networks. Findings and significance of research: The authors performed an analysis that concluded to eight network patterns, four for the stages (I, II, III and IV) and four for the subtypes (Triple Negative, Luminal A, Luminal B and HER2). These patterns were shown to highlight four exclusive stage-related pathways including phenylalanine metabolism for Stage II, peroxisome proliferator-activated signalling pathway and glycolysis and gluconeogenesis for Stage III and toll-like receptor signalling pathway for Stage IV. Finally, the authors performed drug repurposing to elucidate potential anti-breast-cancer properties for known drugs and they compared the molecular structure for their predicted re-purposed drugs against 25 FDA-approved drugs of clinical use. Two out of these 25 drugs (Gemcitabine and Palbociclib) were also found as repurposed drugs by the authors. In Stage I, two repurposed drugs, Clofarabine and Kinetin-riboside, were found to be structurally similar to Gemcitabine. Clofarabine seems to have potential efficacy in epigenetic therapy of solid tumours, especially at early stages of carcinogenesis. Another recent line of work [64] performed network-based in silico drug efficacy screening by exploiting network-based approaches. The authors investigated the association between drug targets and diseases, presenting a drug–disease proximity measure [64]. Data collection: The authors used 1489 diseases defined by Medical Subject Headings (MeSH) compiled in a recent study [90]. For each disease, the disease–gene associations were collected from OMIM and GWAS catalogue. For each disease, the authors extracted information on FDA-approved drugs from DrugBank and matched 79 of these diseases with at least one drug using tools like MEDI-HPS and Metab2Mesh resulting in 238 unique drugs and 384 targets. The authors took information published by [90] that contained experimentally documented human protein physical interactions from TRANSFAC, IntAct, MINT, BioGRID, HPRD, KEGG, BIGG, CORUM, PhosphoSitePlus and a large-scale signalling network. Network construction: The human PPI network was compiled using information extracted from the databases described above, to generate an elaborate human interactome. The largest connected component of this interactome was consequently used in their analysis, consisting of 141 150 interactions between 13 329 proteins. Entrez Gene IDs were used to map disease-associated genes to the corresponding proteins in the interactome. Network analysis: The proximity between a disease and a drug was evaluated using various distance measures that take into account the path lengths between drug targets and disease proteins. The authors focused on two types of network-based proximity relationships between drugs and disease proteins: (i) the most straightforward measure is the average shortest path length between all targets of a drug and the proteins involved in the same disease; (ii) the second proximity is the closest measure, representing the average shortest path length between the drug’s targets and the nearest disease protein. Findings and significance of research: The authors validated their approach and optimized their proximity thresholds by assessing how well relative proximity discriminates 402 known drug–disease pairs from the 18 162 unknown drug–disease pairs by comparing the area under Receiver Operating Characteristic curve for different distance measures. Based on these results the authors showed that network proximity delineates therapeutic effects of a drug. This approach of utilizing network proximity in the interactome for drug targets and diseases, allowed for increased understanding in the therapeutic effect of drugs. They made use of cases from Parkinson’s disease and several inflammatory disorders to further substantiate findings. This approach can potentially have significant applications in drug discovery, drug repurposing and assessment of drug adverse effects. Another study [123] led to the development of a current state-of-the-art tool that addresses computational therapeutics from a network perspective, the TCM-Mesh system. This tool allows for the high-throughput network pharmacology analysis for Traditional Chinese Medicine (TCM) [123]. Data set collection: TCM utilizes data curated from collections of 6235 herbs, 383 840 compounds, 14 298 genes, 6204 diseases, 144 723 gene–disease associations, 3 440 231 pairs of gene interactions, 163 221 side-effect records and 71 toxic records (data as of April 2017). The information for traditional Chinese herbs and traditional Chinese medicine preparation was extracted from TCM Database@Taiwan, TCMID, information of compounds and their targets; diseases and their related proteins were obtained from STITCH and OMIM, respectively; the protein interactions were obtained from STRING; the toxic and side-effect records of compounds were derived from TOXNET and SIDER. Network construction: The authors used Cytoscape as well as a web-based software to facilitate visualization of a compound–gene–disease network construction between TCM and treated diseases. Network analysis: The authors based their network analysis and scored their compounds using the combined score as defined and obtained from the STITCH database. This score represents the strength of the links between the compounds and their associated proteins. Findings and significance of research: The authors used 1293 FDA-approved drugs, as well as compounds from a herbal material Panax ginseng and a patented drug Liuwei Dihuang Wan for evaluating their database. By comparison of different databases, as well as checking against literature, they demonstrated the completeness, effectiveness and accuracy of the TCM-Mesh database and further aided in increased understanding of the molecular mechanisms of TCM action. Various tools are currently available, which can aid the Systems Bioinformatics application in drug discovery. For example, tools like Substructure-Drug-Target Network-Based Inference SDTNBI [124], C(2) Maps [125], Chem2Bio2RDF [126] and PROMISCUOUS [127] cumulatively provide integrated systems and pharmacology databases for chemoinformatics analysis, drug-target prediction, networks of disease–gene–drug connectivity relationships as well as drug repositioning analysis. For a full list of tools and databases adopting or supporting Systems Bioinformatics methodologies, see Table 1. A more comprehensive list of related tools and databases going back to 2010 can be found in Supplementary Table S1. Table 1 Tools and databases for systems bioinformatics approaches in therapeutics, diagnostics, network visualization/analysis, integration and systems modelling Discussion The concept of utilizing networks to visualize the complex interaction of mechanisms implicated in disease has been around for several years. However, two important breakthroughs separate previous network-based approaches and are currently driving the state-of-the-art in Systems Bioinformatics: (i) construction of multiple networks representing each level of the omics spectrum and the integration of these in a layered network that exchanges information within and between layers [62, 67, 97] and (ii) the advent of novel techniques and methodologies for analysing and understanding these networks using mathematical algorithms and approaches derived from graph theory and information theory [6]. Using these methods for extracting biologically meaningful information from multiple levels of the omics spectrum can provide the integrated systemic knowledge for the development of a comprehensive Human System profile, which increases diagnostic accuracy and concurrently allows for novel therapeutic advances and assess response to therapy. Nevertheless, the network-based approaches, either for evidence-based or for statistically inferred molecular networks, have a number of limitations. Specifically, networks based on experimental evidence are not complete as experiments are only a snapshot of the real biological world. Moreover, statistically inferred networks represent an undetermined computational problem because the number of the inferred relationships is much larger than the number of the independent measurements [196]. Owing to the lack of sufficient ground truth to validate the reconstructed molecular networks, special attention must be given when choosing benchmark data sets (e.g. existing curated databases with experimental information, medium-throughput experimental information and simulated data sets that mimic real data). To this extent, the DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative facilitates researchers from the Systems Bioinformatics field to assess the validity of the networks they are using and proceed with optimization and parameter-tuning regarding network reconstruction [197]. A subsequent limitation of the network-based approaches is the low overlap that the various network reconstruction approaches have, and the inadequacy in selecting the proper network each time. It is likely that computational approaches checking and exploiting complementarities and providing ensemble solutions of a network construction consensus will maximize the information content. In our opinion, Systems Bioinformatics might currently appear rather aspirational, yet, considering its potential it is likely to have a major impact on medicine and pharmacology in the next decade. The field of medicine is expected to benefit from the invaluable knowledge attained from Systems Bioinformatics methodologies. The molecular basis of complex, polygenic diseases is highly heterogeneous and affected by multiple factors simultaneously. These include genetic predisposition, multipart molecular mechanisms and effects of the environment, diet, drug administration/response and numerous other factors. Although it might not be possible to replace the use of traditional approaches for therapeutics and diagnosis with computational methods, yet, it is likely that Systems Bioinformatics will provide revolutionary approaches and tools to clinicians in order to demystify the complex nature of these diseases. Computational diagnostics and therapeutics, enhanced by Systems Bioinformatics approaches, will not only aid clinicians in patient consultation and care but will also catalyse significant breakthroughs in prognostic measures, detection of disease at an early onset and overall disease prevention.
Title	Systems bioinformatics applications
Section	Systems bioinformatics applications in biomarker discovery The use of networks in computational diagnostics via the detection of molecular biomarkers is one of the hallmarks of Systems Bioinformatics. Numerous recent state-of-the-art studies have made use of such networks to characterize cellular systems by simultaneously analysing thousands of genes, proteins, isoforms and complexes to address issues of computational diagnostics. Here, we highlight a few studies, showcasing the essence of networks’ contribution in precision diagnostics. A recent study [62] used a machine learning approach, which integrates topological features from PPI networks, to identify candidate AD-associated genes. Data set collection: Positive and negative data sets were collected from Entrez Gene database at the National Centre for Biotechnology Information (NCBI). The positive data set consisted of 458 genes known to be associated with AD. The negative data set consisted of the additional 55 947 Entrez genes, excluding the AD-associated genes. Network construction: Human PPI data sets were extracted from a variety of sources including Online Predicted Human Interaction Database (OPID), STRING, MINT, BIND and InTAct databases. Network analysis: By utilizing the PPI networks, the authors extracted topological features for the AD- and non-AD-associated genes. These features included nine topological properties of the PPI network for each gene, namely, the average shortest path length, betweenness centrality, closeness centrality, clustering coefficient, degree, eccentricity, neighbourhood connectivity, topological coefficient and radiality. Findings and significance of research: The authors further combined sequence features and functional annotations features and concurrently performed feature selection using seven methods including gain-ratio-based attribute evaluation, oneR algorithm, chi-square-based selection, correlation-based selection, information gain-based attribute evaluation and relief-based selection. The most important features were fed into 11 machine learning algorithms to generate classifiers using the training data set capable of predicting AD- and non-AD-associated genes using the selected network, sequence and functional features. Methods included Naive Bayes (NB), NB Tree, Bayes Net, Decision table/NB hybrid classifier, Random Forest, J48, Functional Tree, Locally Weighted Learning (J48 + k-nearest neighbour), Logistic Regression and Support Vector Machine. Training of sophisticated machine learning classifiers using systemic properties can be a key feature in generating personalized medicine diagnostic approaches. The authors finally combined diagnostics with therapeutics by screening 45 known anti-Alzheimer drugs from DrugBank against novel predicted probable AD targets, obtained from their trained classifiers, using molecular docking. They further proposed a novel candidate untried drug, AL-108, with high affinity to potential therapeutic targets. Additional tools were also used to validate preliminary findings, including molecular dynamics simulations and MM/GBSA calculations on the docked complexes [62]. Another interesting study [104] used data from TCGA [105] to successfully construct a multidimensional subnetwork atlas for cancer prognosis. The authors addressed how multiple genetic and epigenetic factors (i.e. gene expression, copy number variation, miRNA expression and DNA methylation) affect molecular states of networks and patient survival. Dat aset collection: The multidimensional cancer-associated data sets for 1027 patients for four cancer types were collected from TCGA Cancer Browser (https://genome-can cer.ucsc.edu/proj/site/hgHeatmap/). They contained clinical information, copy-number variation, promoter DNA methylation, mRNA-gene and miRNA expression data. They furthermore extracted PPI data from HPRD for network construction. To enrich these networks with additional miRNA-regulatory information the authors extracted miRNA and target gene information from two miRNA target databases [miRTarBase (Release 4.5) and TarBase v6], which provide experimentally validated miRNA–target interactions. Network construction: PPI interaction network was constructed using data collected from HPRD. Network analysis: The authors fitted a univariate Cox proportional hazards model between each molecular feature and patient survival time and thus scored each gene based on its significance to predict survival. Genes with a positive score were considered as survival-related genes. They next used this score (heat score) as the input into HotNet2, which uses a heat diffusion process and a statistical test-based algorithm to discover subnetwork signatures in the PPI network. Through this way subnetwork signatures of survival-related genes were determined both by the scores of their genes as well as gene topology in the PPI network. Findings and significance of research: The authors then used Monte Carlo cross-validation and permutation testing procedure to assess predictive power of the subnetworks on patient overall survival. They used a Cox proportional hazards model with L1 penalized log partial likelihood (LASSO) for feature selection to train the models based on the molecular profile of individual subnetworks. Finally, the prognostic outcomes for the training set were used to determine the regression coefficients. These coefficients were then used in the testing model to predict outcomes for patients in the test set and calculate the concordance index (C-index). Results reveal novel PPI subnetworks with significant prognostic capabilities for a variety of cancer types. The authors further validated their subnetworks by performing prognostic impact evaluation, functional enrichment analysis, drug target annotation, tumour stratification and independent validation. They highlighted distinct pathways in the underlying subnetworks as potential new targets for therapeutic intervention for certain cancer types. This study integrated the protein interactome with cancer genomics data, thus allowing for a systemic analysis of the molecular mechanisms that underlie genesis of cancer and provides new directions in personalized cancer therapy [104]. Another recent study [106] adopted an approach that uses an enriched library of single-stranded oligodeoxynucleotides to profile complex biological samples. This method allows for the analysis of systemic native biomolecules. The authors defined their method as Adaptive Dynamic Artificial Poly-ligand Targeting and further utilized it as a diagnostic tool to profile plasma exosome of cancer patients. They achieved high classification accuracy in breast cancer patients by analysing the circulating exosomes in their blood [106]. The online database MelGene is yet another example of successful integration of Systems Bioinformatics approaches in current research for molecular diagnostics. This tool provides a comprehensive, regularly updated collection of data from genetic association studies in cutaneous melanoma, including random-effects meta-analysis results of all eligible polymorphisms [107]. The MelGene proposed network connections highlight potentially new loci in relation to melanoma risk. Recent studies have shown that interpretation of proteomics data using network-based approaches can offer additional insights into the mechanistic and dynamics of protein assemblies, and hence into the molecular mechanisms underlying the system under study. Moreover, network-based approaches can be used to reconstruct a disease-perturbed cellular network model showing the interactions of identified differentially expressed proteins involved in selected cellular pathways related to the target pathophysiology. For example, Shirasaki et al. [108] have used affinity purification coupled to MS to investigate the proteome profile of Huntington’s disease. In particular, using a monoclonal antibody against huntingtin (Htt), they identified 747 proteins to be complexed with Htt. A systems-level view of Htt interactome was achieved by using WGCNA, which was used to construct weighted links between the Htt co-purifying proteins. Using topological overlap, the data were clustered into eight Htt-interactome modules that were related to distinct functional aspects such as brain region specificity, aging and protein aggregation modulation or Htt functions directly [108]. Moreover, several network-based approaches have been developed that can identify the cellular pathways which are altered under pathophysiological conditions, and can hence enrich biomarker discovery. For example, functional enrichment analysis of GO biological processes or KEGG pathways [37] of differentially expressed proteins can be performed using both free licence tools such as Database for Annotation, Visualization and Integrated Discovery (DAVID) [109], Protein ANalysis THrough Evolutionary Relationships (PANTHER) [110] and Gene Set Enrichment Analysis (GSEA) [34], as well as commercialized tools such as MetaCore [36] and Ingenuity Pathway Analysis. Furthermore, pathway topology approaches have been developed as alternative to enrichment analysis. For example, Signalling Pathway Impact Analysis [111] and Network Perturbation Amplitude [112] deliberate whether the proteins involved in functional modules defined by other databases interact with each other in cellular networks. Various tools are currently available, which can aid the Systems Bioinformatics application in biomarker discovery. GWAB, a recent tool, makes use of systems approaches and computational methods to boost weak association signals for Genome Wide Association Studies (GWAS), a common problem when analysing this type of data. This tool works by incorporating publicly available data in the form of using GWAS summary statistics (p-values) for SNPs along with reference genes for a disease of interest. The authors demonstrated the feasibility of boosting GWAS disease associations using gene networks and further present a web server for GWAB, for the network-based boosting of human GWAS data [113]. Other tools like GeneMANIA [114] and PINTA [115] allow for gene prioritization and gene function prediction and can greatly aid in computational diagnostics.
Title	Systems bioinformatics applications in biomarker discovery
Section	Systems Bioinformatics applications in drug discovery Systems Bioinformatics contributes in computational therapeutics by providing tools and algorithms for novel drug discovery. Research in this direction is often done in close collaboration with pharmaceutical companies. One of the main challenges faced by both the research community and the industry is the prediction of adverse drug effects, especially during the early stages of drug development. These types of predictions can lead to significant cost reductions by allowing for accurate drug assessment and discontinuation of development for drugs with severe adverse effects. The use of human genetic variation has been known to play an important role in drug response [116]; however, the effect of this factor alone is not sufficient to provide a complete perspective on the matter in hand. Systems pharmacology is a term that is widely used today in many high-calibre, recently published studies [117–119]. Systems pharmacology is a systems biology approach, which focuses on enhancing the understanding of drugs function in the human body at a systems’ level, described by several types of networks, rather than looking at the effect of single molecular components. It shifts away from traditional practice, which considers the effects of a drug with respect to its target protein and instead strives to address the effects of the drug by considering a network of drug–target interactions. Systems Bioinformatics is a precious field in the neighbourhood of systems pharmacology that provides important methods and tools for multi-source and multilevel integration of the omics spectrum with drug networks shedding light in the area of modern drug discovery. A recent area of great interest where Systems Bioinformatics can be of substantial impact and value is the area of drug repurposing or repositioning [120]. This entails the use of Food and Drug Administration (FDA)-approved drugs to treat new diseases, which are different from the ones they were initially designed for. This allows for obvious shortcuts for pharmaceutical companies allowing them to by-pass the timely and costly process of FDA approval for novel drugs. Recent studies used gene expression data derived from microarrays or RNA-Seq data to obtain specific expression profiles for specific diseases of interest. By comparing these to collections of data sets from repositories such as CMap [121], Drugmap Central and more advanced versions like LINCS and the recent Drug Repurposing Hub [121, 122] allows for alternative drugs to be proposed for the treatment of diseases under investigation. In a recent study [38], this approach was used to devise drug/target networks obtained from algorithms of mutual information and co-expression networks aiming to gain insights into the treatment of breast cancer subtypes. Data collection: TCGA mRNA (microarray) gene expression data for Breast Invasive Carcinoma cases were obtained from Firehose (http://gdac.broadinstitute.org/). From a total of 587 samples (526 primary solid tumour samples and 61 primary solid normal samples—17.814 genes), the authors selected a subset of tumour data containing information regarding breast cancer staging, HER2, ER and PR status with their corresponding normal samples as well as breast cancer stages I, II, III and IV. Network construction: The authors examined three major categories of statistical network inference methods: (i) mutual information-based methods, (ii) correlation-based methods and (iii) tree-based methods. They further utilized Biological information-based network methods and one ensemble scheme using all statistical network inference methods. They used the Cytoscape platform and more specifically the GeneMania plug-in to reconstruct the biological information-based gene network. This plug-in uses a large data set unifying functional networks comprising approximately 800 networks for six organisms including Homo sapiens. Using the H.sapiens network they constructed a sub-network for the top 1000 differentially expressed genes (DEGs) from the TCGA data set merging five Network types: Co-expression, Physical Interaction, Genetic interaction, Co-localization and Pathways. Network analysis: The authors further performed gene re-ranking using the underlying networks. To investigate the influence of the reconstructed 17 gene networks (12 statistically and 5 biologically inferred) on gene prioritization, they applied a method that allows for a custom network selection combining the log fold change absolute values with the selected underlying network topology to re-rank the initial DEGs. The basic idea of the method is the reconciliation of the gene expression values taking into account the underlying gene network topological features such as degree and betweenness. The network patterns were further analysed to investigate their exclusive contribution with respect to breast cancer subtypes and stages. The authors then performed drug repurposing using the up- and down-regulated genes forming disease signatures by querying them in a well-established drug repurposing pipeline, namely, LINCS-L1000 (http://www.lincscloud.org/), an advanced version of CMap. In summary, the authors obtained 63 unique drugs for the breast cancer stages and 58 for the breast cancer subtypes. To further examine the resulting drugs, the authors constructed super networks by combining top drugs extracted from their analysis with the FDA-approved breast cancer drugs, connecting them with their target genes and superimposing these on the gene expression networks. Findings and significance of research: The authors performed an analysis that concluded to eight network patterns, four for the stages (I, II, III and IV) and four for the subtypes (Triple Negative, Luminal A, Luminal B and HER2). These patterns were shown to highlight four exclusive stage-related pathways including phenylalanine metabolism for Stage II, peroxisome proliferator-activated signalling pathway and glycolysis and gluconeogenesis for Stage III and toll-like receptor signalling pathway for Stage IV. Finally, the authors performed drug repurposing to elucidate potential anti-breast-cancer properties for known drugs and they compared the molecular structure for their predicted re-purposed drugs against 25 FDA-approved drugs of clinical use. Two out of these 25 drugs (Gemcitabine and Palbociclib) were also found as repurposed drugs by the authors. In Stage I, two repurposed drugs, Clofarabine and Kinetin-riboside, were found to be structurally similar to Gemcitabine. Clofarabine seems to have potential efficacy in epigenetic therapy of solid tumours, especially at early stages of carcinogenesis. Another recent line of work [64] performed network-based in silico drug efficacy screening by exploiting network-based approaches. The authors investigated the association between drug targets and diseases, presenting a drug–disease proximity measure [64]. Data collection: The authors used 1489 diseases defined by Medical Subject Headings (MeSH) compiled in a recent study [90]. For each disease, the disease–gene associations were collected from OMIM and GWAS catalogue. For each disease, the authors extracted information on FDA-approved drugs from DrugBank and matched 79 of these diseases with at least one drug using tools like MEDI-HPS and Metab2Mesh resulting in 238 unique drugs and 384 targets. The authors took information published by [90] that contained experimentally documented human protein physical interactions from TRANSFAC, IntAct, MINT, BioGRID, HPRD, KEGG, BIGG, CORUM, PhosphoSitePlus and a large-scale signalling network. Network construction: The human PPI network was compiled using information extracted from the databases described above, to generate an elaborate human interactome. The largest connected component of this interactome was consequently used in their analysis, consisting of 141 150 interactions between 13 329 proteins. Entrez Gene IDs were used to map disease-associated genes to the corresponding proteins in the interactome. Network analysis: The proximity between a disease and a drug was evaluated using various distance measures that take into account the path lengths between drug targets and disease proteins. The authors focused on two types of network-based proximity relationships between drugs and disease proteins: (i) the most straightforward measure is the average shortest path length between all targets of a drug and the proteins involved in the same disease; (ii) the second proximity is the closest measure, representing the average shortest path length between the drug’s targets and the nearest disease protein. Findings and significance of research: The authors validated their approach and optimized their proximity thresholds by assessing how well relative proximity discriminates 402 known drug–disease pairs from the 18 162 unknown drug–disease pairs by comparing the area under Receiver Operating Characteristic curve for different distance measures. Based on these results the authors showed that network proximity delineates therapeutic effects of a drug. This approach of utilizing network proximity in the interactome for drug targets and diseases, allowed for increased understanding in the therapeutic effect of drugs. They made use of cases from Parkinson’s disease and several inflammatory disorders to further substantiate findings. This approach can potentially have significant applications in drug discovery, drug repurposing and assessment of drug adverse effects. Another study [123] led to the development of a current state-of-the-art tool that addresses computational therapeutics from a network perspective, the TCM-Mesh system. This tool allows for the high-throughput network pharmacology analysis for Traditional Chinese Medicine (TCM) [123]. Data set collection: TCM utilizes data curated from collections of 6235 herbs, 383 840 compounds, 14 298 genes, 6204 diseases, 144 723 gene–disease associations, 3 440 231 pairs of gene interactions, 163 221 side-effect records and 71 toxic records (data as of April 2017). The information for traditional Chinese herbs and traditional Chinese medicine preparation was extracted from TCM Database@Taiwan, TCMID, information of compounds and their targets; diseases and their related proteins were obtained from STITCH and OMIM, respectively; the protein interactions were obtained from STRING; the toxic and side-effect records of compounds were derived from TOXNET and SIDER. Network construction: The authors used Cytoscape as well as a web-based software to facilitate visualization of a compound–gene–disease network construction between TCM and treated diseases. Network analysis: The authors based their network analysis and scored their compounds using the combined score as defined and obtained from the STITCH database. This score represents the strength of the links between the compounds and their associated proteins. Findings and significance of research: The authors used 1293 FDA-approved drugs, as well as compounds from a herbal material Panax ginseng and a patented drug Liuwei Dihuang Wan for evaluating their database. By comparison of different databases, as well as checking against literature, they demonstrated the completeness, effectiveness and accuracy of the TCM-Mesh database and further aided in increased understanding of the molecular mechanisms of TCM action. Various tools are currently available, which can aid the Systems Bioinformatics application in drug discovery. For example, tools like Substructure-Drug-Target Network-Based Inference SDTNBI [124], C(2) Maps [125], Chem2Bio2RDF [126] and PROMISCUOUS [127] cumulatively provide integrated systems and pharmacology databases for chemoinformatics analysis, drug-target prediction, networks of disease–gene–drug connectivity relationships as well as drug repositioning analysis. For a full list of tools and databases adopting or supporting Systems Bioinformatics methodologies, see Table 1. A more comprehensive list of related tools and databases going back to 2010 can be found in Supplementary Table S1. Table 1 Tools and databases for systems bioinformatics approaches in therapeutics, diagnostics, network visualization/analysis, integration and systems modelling D
Title	Systems Bioinformatics applications in drug discovery
Table caption	Table 1 Tools and databases for systems bioinformatics approaches in therapeutics, diagnostics, network visualization/analysis, integration and systems modelling
Section	Discussion The concept of utilizing networks to visualize the complex interaction of mechanisms implicated in disease has been around for several years. However, two important breakthroughs separate previous network-based approaches and are currently driving the state-of-the-art in Systems Bioinformatics: (i) construction of multiple networks representing each level of the omics spectrum and the integration of these in a layered network that exchanges information within and between layers [62, 67, 97] and (ii) the advent of novel techniques and methodologies for analysing and understanding these networks using mathematical algorithms and approaches derived from graph theory and information theory [6]. Using these methods for extracting biologically meaningful information from multiple levels of the omics spectrum can provide the integrated systemic knowledge for the development of a comprehensive Human System profile, which increases diagnostic accuracy and concurrently allows for novel therapeutic advances and assess response to therapy. Nevertheless, the network-based approaches, either for evidence-based or for statistically inferred molecular networks, have a number of limitations. Specifically, networks based on experimental evidence are not complete as experiments are only a snapshot of the real biological world. Moreover, statistically inferred networks represent an undetermined computational problem because the number of the inferred relationships is much larger than the number of the independent measurements [196]. Owing to the lack of sufficient ground truth to validate the reconstructed molecular networks, special attention must be given when choosing benchmark data sets (e.g. existing curated databases with experimental information, medium-throughput experimental information and simulated data sets that mimic real data). To this extent, the DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative facilitates researchers from the Systems Bioinformatics field to assess the validity of the networks they are using and proceed with optimization and parameter-tuning regarding network reconstruction [197]. A subsequent limitation of the network-based approaches is the low overlap that the various network reconstruction approaches have, and the inadequacy in selecting the proper network each time. It is likely that computational approaches checking and exploiting complementarities and providing ensemble solutions of a network construction consensus will maximize the information content. In our opinion, Systems Bioinformatics might currently appear rather aspirational, yet, considering its potential it is likely to have a major impact on medicine and pharmacology in the next decade. The field of medicine is expected to benefit from the invaluable knowledge attained from Systems Bioinformatics methodologies. The molecular basis of complex, polygenic diseases is highly heterogeneous and affected by multiple factors simultaneously. These include genetic predisposition, multipart molecular mechanisms and effects of the environment, diet, drug administration/response and numerous other factors. Although it might not be possible to replace the use of traditional approaches for therapeutics and diagnosis with computational methods, yet, it is likely that Systems Bioinformatics will provide revolutionary approaches and tools to clinicians in order to demystify the complex nature of these diseases. Computational diagnostics and therapeutics, enhanced by Systems Bioinformatics approaches, will not only aid clinicians in patient consultation and care but will also catalyse significant breakthroughs in prognostic measures, detection of disease at an early onset and overall disease prevention.
Title	Discussion
Section	Supplementary Material bbx151_Supp Click here for additional data file.
Title	Supplementary Material

Annnotations

blinded

PMC:6585387 JSONTXT 3 Projects

Document structure show

Annnotations

PMC:6585387 JSON TXT 3 Projects