Computational_Biology

PMC:5372779 JSON TXT 3 Projects

New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data Abstract Abstract Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology—type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types—single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc. 1. Background Continuous progress in the development of high-throughput genotyping and sequencing technologies led to the information overload problem that is likely to get exacerbated as the tools become even more accessible. Translated into the data analysis vernacular, the challenge is essentially threefold: (1) increasing the scalability of the data analysis methodology to accommodate large-scale datasets and (2) incorporating many heterogeneous data types into the analysis framework, while (3) being able to account for, and interpret, nonadditive interactions between the variables (genetic and otherwise). These three components are closely interrelated, in that trying to build more complex and expressive models (containing many data types and allowing for high-order variable interactions) necessarily and severely compromises scalability. Indeed, baseline analysis of a typical genetic epidemiology dataset, for example, one million univariate single-nucleotide polymorphism (SNP)–phenotype tests corrected for multiple comparisons, generated by a genome-wide association study (GWAS) is computationally undemanding, but does not even begin to address the latter two issues, while more sophisticated analysis methods tend to be NP-hard and therefore computationally infeasible for the datasets containing large numbers of variables. Many of such methods belong to the domain of systems biology data analysis. Their primary goal is to grasp the underlying biological system in its entirety, including the high-order interactions between the variables contained in the data, potentially leading to the generation of new biological hypotheses of varying levels of complexity. Coincidentally, additional challenges arise in (1) presenting the output of the systems biology data analysis method in a format that can be understood and interpreted by a human expert in a specific biomedical research field (e.g., genetic epidemiology of particular trait/disease in case of GWAS genotype–phenotype mapping) and (2) guarding against overfitting/overparameterization (learning spurious relationships from the datasets of unfavorable dimensionality, that is, the too many variables—too few observations problem). A prototypical systems biology data analysis method is Bayesian network (BN) modeling, in which a graphical model (BN) of joint probability distribution of random variables contained in the dataset is reconstructed directly from the data. All observed biological variables and measurements are understood to have a probabilistic nature. Thus, in BNs, nodes correspond to random variables and edges to dependencies. The edges are directed, reflecting (sometimes arbitrary, strictly for the purposes of mathematical convenience) ancestor–descendant relationships between the variables. This directionality is important as it defines a unique representation for the multiplicative partitioning of the joint probability and, subsequently, a direction of inference in the BN once a BN structure (topology) is inferred. The absence of an edge between the two variables indicates conditional independence (although this cannot be strictly guaranteed in practice). BN modeling, and network implementation in general, has been extensively used in genetics, bioinformatics, and computational biology since the turn of the century (Friedman et al., 2000; Pe'er, 2005; Rodin et al., 2005; Djebbari and Quackenbush, 2008; Rodin et al., 2012; Liu et al., 2014; Lo et al., 2015; Tasaki et al., 2015; Li et al., 2016, to name but a few). A detailed treatment of BN methodology, while outside of the scope of this communication, can be found in Pearl (1988), Pearl (2000), Heckerman (1995), and Chickering et al. (2004). We also refer the reader to Rodin et al. (2012) and references therein for interpretation of ancestor–descendant relationships, causality, directionality, and BN validation in the genetic epidemiology context (defined by datasets containing many SNP variables and phenotype and epidemiological variables, among others). Briefly, the advantages of the BN modeling over simpler, less expressive, data analysis methods are (1) the ability to incorporate many different data types into analysis, (2) the ability to account for high-order variable interactions (e.g., epistatic and gene environment), and (3) an output (a graphical network model) that is easily understood and interpreted by a human expert. In addition, unlike certain other methods operating in Euclidean space, the BN approach is context independent and has a number of attractive theoretical properties allowing mixing of different data types in a theoretically sound probabilistic framework. In addition, due to the natural sparseness of biological systems (i.e., each node in a network being directly connected with a limited number of other nodes), the resulting BN models are relatively easy to compartmentalize, which augurs well for both reconstruction and visualization scalability. Thus, BNs are an excellent biological modeling and hypothesis generation tool. Three major practical difficulties associated with BN modeling in our research context are (1) limited scalability, at least compared with the more simplistic analyses, (2) absence of the readily available software aimed primarily at the biomedical data that can take advantage of the known genetic data type structures and various omics formats, and (3) interpretation, validation (e.g., using statistical resampling), and visualization of larger BNs. A specific, but persistent, challenge, on both theoretical and practical levels, is combining continuous and discrete variables together into a comprehensive hybrid probability model within the same BNs. 2. Existing Algorithms and Software Packages Two useful BN reconstruction software lists can be found at (http://www.cs.ubc.ca/∼murphyk/Software/bnsoft.html and http://www.kdnuggets.com/software/bayesian.html). These include both commercial and free general-purpose BN modeling software and BN-based classifiers. A more selective list of free packages, compared in the bioinformatics application context, is compiled in Paluszewski and Hamelryck (2010), with a special emphasis on the dynamic BNs (DBNs). pwOmics (Watcher and Beisbarth, 2015) is the most recent implementation of DBN modeling in omics data context. Another recent BN reconstruction algorithm (Jiang et al., 2010a, 2010b; Jiang and Neapolitan, 2012) is of special interest as the authors attempted to increase the scalability of BN modeling to make it directly usable with the large-scale genetic epidemiology datasets. However, it has limitations related to the constraints imposed on the general BN structure and/or on the variables effectively preselected for BN analysis. This work had been followed up in Neapolitan et al. (2014), Jiang and Neapolitan (2015), and Jiang et al. (2015), but remains subject to these and related limitations. Large-scale genetic epidemiology dataset BN analysis was also pursued in Han et al. (2012) at the cost of specifying a single target variable. BN Webserver (Ziebarth et al., 2013) is a comprehensive biological BN analysis tool, which, among other things, efficiently deals with hybrid models (heterogeneous variables/data types) in a biological user-friendly manner; unfortunately, its scalability is essentially nonexistent (<20 nodes). Constructing multilevel gene regulatory networks (Guan et al., 2014) aims at ChiP-seq and gene expression data, but suffers from the same major shortcoming (limited scalability). A more theoretically rigorous approach of effectively reducing BN to a Markov neighborhood of a variable of interest (Gao and Ji, 2016) is intriguing, but has not been deployed in actuality. Similarly, a theoretically attractive approach to inferring causality via intervention data (Cho et al., 2016) suffers, once again, from low scalability and limited deployment. In general, theoretical rigor and distributional flexibility on one hand and scalability on the other tend to be mutually exclusive (see Yin et al., 2015a, for another recent example). As an important aside, when developing BNOmics, complete code transparency was a priority. This makes it much easier to change and augment the BN reconstruction engine (local search/optimization algorithm) on the fly. Therefore, BNOmics is explicitly designed to be sufficiently flexible to incorporate different variations of baseline search algorithms, network scoring functions, and discretization and imputation approaches. As such, BNOmics engine is ideally suited to be incorporated into a typical comparative simulation study framework. It should be emphasized that first and foremost, BNOmics is a prototype/proof-of-concept design of a research platform prioritizing simplicity, flexibility, and adaptability to various biomedical data analysis applications rather than an overly complex production-level software package with all imaginable options and extensions. 3. Algorithm and Implementation BNOmics is realized as a series of Python scripts, including the data formatting and storage facilities, actual BN reconstruction engine, output interface, and various optional support routines (data reformatting plug-ins). A Python interpreter with a standard set of modules as well as additional numerical libraries (numpy) is required to run the software. Help (readme) files and the example input data files (see section 4 for the example application) are provided as part of the distribution. Computationally, most intensive parts of BN reconstruction engine are implemented in C++ using ctypes interface. 3.1. Data storage and input format The input data file is a plain, flat (variables by observations/individuals) text file in a format similar to the typical comma-delimited spreadsheet export file. Loading from other common file formats, streams, and strings is also supported. Because the basic BN reconstruction algorithm uses multinomial local probability model, in the baseline implementation, discretization of continuous variables is necessary (but see section 3.2). Optional scripts are available for automated input file generation, including common discretization procedures (equal size bins, equal value ranges, entropy-based discretization, etc.). In the context of genetic epidemiology datasets, most variables are discrete by nature (e.g., SNPs, allelic states); however, one should be careful when discretizing continuous phenotypes or, for example, metabolomic measurements. Therefore, if possible, user-driven manual or semimanual discretization is advised (and can be easily accomplished on the fly within the Python environment—it is precisely the flexibility of such nature that led us to choose Python over other languages). Similarly, we advise carrying out user-driven missing value imputation before engaging the BNOmics software—although optional imputation routines (using majority, frequency, and proximity rules) are available, sensible imputation is highly dependent on the specific data type and quality control procedures implemented during the data generation stage. For example, when analyzing metabolomic data, it is difficult to distinguish between the metabolite measurement value missing due to a technical error, low metabolite concentration, or the actual metabolite absence in the sample. Such technical artifacts have to be dealt with manually or semimanually, and with large datasets, the only practical way to do so is to algorithmically parse the data (which, again, is easily achieved by using a Python interpreter as a universal control interface). A more advanced imputation algorithm, based on the local probabilistic inference in the immediate network neighborhood of a variable in question (with missing values), is also available, but its properties have not been comprehensively assessed yet and it will be described in detail elsewhere (A potential concern with such approaches is their general susceptibility to overfitting.). There is no explicit algorithmic or software limit imposed on the input dataset size (number of variables and observations/individuals); however, data files larger than 2–3 GB are not recommended for a typical workstation (16 GB memory or less) installation without certain modifications (to keep it in perspective, 500K variables-strong GWAS dataset containing ∼2000 case/controls fits in just under 1.5GB). To optimize the data storage, retrieval, and memory I/O access in these situations, the actual data (single-type variable values) have to be stored separately from the annotation file, row-wise, following the approach espoused by Nielsen and Mailund (2008) to compensate for disk-bound I/O latency. When dealing with extremely large data files, some form of batch data access may be required, with the whole segments of the file loaded directly into RAM (which explains the need for row-wise data structure). It should be noted that these issues are very architecture and problem specific and are best addressed on a case-by-case basis within Python interface. Among the implemented speed/scalability improvement measures is the integer-type encoding and representation of the data in memory, which leads to a smaller footprint, a faster access, and an ability to directly apply a number of efficient numerical operations. The input dataset entries are coded in 0, 1, 2,… format. Further improvements are a result of an optimized algorithm design (section 3.3). Together, these measures increase computational efficiency by approximately an order or two of magnitude (depending on the data and compared with a typical BN software implementation) without imposing restrictions on BN structure. It remains to be mentioned that current limit on the number of samples is flexible and depends on hardware. In a recent application, for example, ∼107 samples (by 100+ variables), dataset analysis (results not shown) was completed in under 24 hours. While the algorithm is, in principle, linear with respect to the number of samples, at some point it becomes a bottleneck (see section 4.2). 3.2. Continuous and discrete variables In baseline implementation, due to the limitations of linear Gaussian (for continuous variables) and hybrid BN local probability models (Friedman et al., 2000; Pe'er, 2005), continuous variables in the dataset have to be discretized. This is customary not just for BN modeling but also for many other data analysis methods. However, there are no commonly agreed upon discretization guidelines or standards in bioinformatics, and existing research tends to be very microarray data-centric (Vass et al., 2011). This paucity can, in large part, be attributed to our limited understanding of the theoretical motivation behind the discretization. Indeed, the actual purpose of discretization is twofold: first, to come up with the partitioning (into bins or events) that best reflects characteristic features of the distribution of a continuous variable, and second, to preserve the intervariable relationships (correlation, dependency) and their relative strengths for subsequent data analysis by the algorithms or statistical tests developed for discrete variables. While these two goals obviously overlap, our primary interest lies with the latter and not the former. For example, in the context of genetic epidemiology, we should aim to discretize the continuous variables, such as quantitative trait (QT) endpoints and intermediate phenotypes, in a manner that maintains true relative strengths of SNP-phenotype associations. Consequently, our considerations are motivated by the fact that partitioning is equivalent to selecting only a few events from the sigma algebra of events associated with the distribution in question, and no selection is any more representative than any other when it comes to establishing conditional independence because it has to be established across all events. Another way of looking at it is while any Borel measurable (reasonable and deterministic) partitioning function is independence-preserving, sensu stricto independence of partitioned variables implies nothing with regard to the independence of original variables. On the other hand, to falsify the independence assumption, it is sufficient to have at least one dependent event, which, however, is not known a priori. Hence, the only reasonably efficient way to approach the problem is from the perspective of falsification of the independence assumption, which is significantly less time-consuming than verifying independence over all possible events. In general, our principal proposition for manual or semimanual discretization is that the simplest discretization method (that satisfies basic common sense data type-specific requirements) should be chosen. While partitioning into the fewer bins potentially might be perceived as leading to increased information loss, this is, in fact, only an illusion generated by poor interpretability of often complicated conditional independence, which does not care about the finer features nearly as much as about the change in the conditional distribution over multiple scales (when conditioned on new variables). What matters the most is how the bins/events of one variable intersect/interact with bins/events of another variable or a set of variables. Moreover, any fears associated with coarser partitioning over the finer partitioning should be counterbalanced by the corresponding decrease in random noise and natural variation customary for the biological variable measurements. At the same time, it has been observed by us and others that partitioning into lower number of bins tends to lead to higher edge density in reconstructed BNs (Clarke and Barton, 2000; Rodin et al., 2005), possibly because BN scoring metrics are biased toward fewer variable values. Regardless, coarser binning can be considered as boosting sensitivity, while finer binning as boosting specificity. Technically, this is due to the fact that the conditional probability distributions with fewer states carry fewer constraints, which are therefore easier to satisfy when it comes to conditional entropy minimization (or conditional probability maximization). Such moderate overfitting, especially in the Markov locality of the continuous variables (e.g., QT phenotype variables in GWAS datasets), is actually beneficial for the purposes of exploratory data analysis (automated hypothesis generation). Typically, discretizing into two or three bins with entropy-based partitioning points (Fayyad and Irani, 1993), as long as it does not explicitly conflict with the observed nature of the continuous variable, is to be preferred. It is also more favorable from the computational and memory utilization efficiency viewpoint. This is the default option in BNOmics (maximum entropy clustering-based discretization into a minimal number of bins). However, there is also a novel option of treating both continuous and discrete variables simultaneously. It is fully functional, but at this time does not scale up as well as purely discrete variable design (see section 4.3). 3.3. BN reconstruction algorithm BN reconstruction is generally a two-stage process, involving model selection (search of the network structure or topology that best fits the data) and probabilistic inference propagation given the fixed network structure. The former is NP-hard and therefore some local or heuristic search algorithm is usually employed (instead of exhaustive search) for any dataset of nontrivial size (starting with ∼20 nodes/variables, exact algorithms and computations become infeasible). The candidate network structures, or models, are evaluated using an objective scoring function (metric). These often incorporate, explicitly or implicitly, a model complexity penalty to prevent overfitting. In addition, an initial search state (network structure prior) has to be specified—selection of an optimal (or biologically meaningful) prior has received much attention (e.g., Friedman et al., 1999; Steele et al., 2009; Keilwagen et al., 2010; Zhang et al., 2014), in part, because it may alleviate the scalability problem to a degree. The default BNOmics algorithm is completely agnostic in regard to the BN structure and prior and therefore is entirely data driven. However, restrictions on the network structure (i.e., forbidding or forcing edges between the network nodes) can easily be accommodated if desired. Once the input data file is initialized, the BNOmics algorithm works as follows: given the data D = {x1,…, xN}, where N is the number of variables, we aim to assign these variables to the N nodes of a BN (Fig. 1). FIG. 1. BN reconstruction algorithm kernel pseudocode (Procedures 1 and 2). BN, Bayesian network. Let I = {1,…,N} be the index set of nodes and πi be the parent set of the i-th node xi. We also introduce a real-valued objective (scoring) function F(xi, πi) that has two parameters, the node and its ancestor set. By changing the ancestor set, we can evaluate different subnetworks centered around the node. The simplest (first-order) operator would be an addition of a node xj to the ancestor set πi of the node xi. By varying j over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ k \in I\it :{x_k} \,\notin \,{ \pi _i}, k \ne i \} $$ \end{document}, we can score each configuration sij = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj). Procedure 1 (Fig. 1) details the process of finding the best single ancestor addition to a node. It returns jmax, the index of a node that (when added as an ancestor) maximizes the scoring function. Thus, it generates the best edge to be added to the network in the immediate Markov neighborhood of xi (in its upper part—the parents of xi), the edge representing a pair of nodes (i,jmax). We call this first-order search as the only allowable operator is adding a single edge. After the addition of an edge, a search for an optimal edge to remove from the ancestor set is performed. If such an edge is found, it is dropped and the necessary changes are propagated through the structure. This combination of add and drop searches allows to maintain a relatively optimal ancestor set (the task of finding an optimal ancestor set is one of the fundamentally difficult problems, and a principal contributor to NP-hardness of network reconstruction, and as such is solved only approximately). Similarly, second-order search allows adding two new edges, evaluating sijk = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xk), and so on. Such higher-order searches are obviously more computationally intensive, and the baseline BNOmics implementation uses first-order search only, although stochastic second-order search has been implemented and tested (second-order search is quadratic O(n2) in the number of variables, and if ∼n pairs of variables are sampled at random from (n2−n)/2 total pairs, the second-order search becomes O(n)). It should be noted that first-order search is still capable of discovering higher-order interactions in the general sense as well as in the more specific biological sense (e.g., epistatic interactions between multiple SNPs in a Markov neighborhood of a phenotype node). Not only that, it is possible to arrive to the same ancestor set using different methods as the only thing that matters from the perspective of the order of accounted for interactions is the ancestor set itself. However, higher-order searches do bring us closer to the exhaustive search ideal (although in practice, they provide only marginal improvement, reflected in finer detail, over other search strategies). Because no network structure (topology) priors are assumed, we start with the empty network; all ancestor sets are initialized to the empty sets, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\leftarrow$$ \end{document} ∅, ∀i\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\,\in I$$ \end{document}. Equivalently, we assume independence of all nodes and then try to falsify this assumption by establishing edges. Then, Procedure 1 is applied to each node to obtain a set of N candidate best edges (i, j) with associated scores (first loop in Procedure 2, Fig. 1). The second loop in Procedure 2 (Fig. 1) selects the highest scoring edge and rearranges the current state of the network accordingly (performs bookkeeping). In effect, Procedure 2 reconstructs the network topology by incremental edge addition (forward selection), creating a partial ordering of the nodes along the way. In that respect, it is an elaborate extension of the basic greedy search (hill-climbing) local search/optimization algorithm. It remains to be noted that different stopping criteria (in addition to the scoring function improvement-based one shown in Fig. 1) can be used; in practice, the second loop of Procedure 2 can be bound by the preset iteration limit (i.e., CPU time) or, for smaller networks, the search can continue until full exhaustion (defined by the largest possible increase of the scoring function being commeasurable with the machine epsilon/precision). In addition to the default gradient descent with constraints, stochastically perturbed restarts are also available as an option (to deal with local minima). Optionally, stochastically perturbed restarts are performed as stochastic relaxation over the structure and subsequent reconvergence until a more optimal structure is found. This algorithm compares favorably with other recent attempts to scale up the BN reconstruction (see section 2 above; also Friedman et al., 1999), in that it does not limit the BN structure flexibility by treating the SNPs, gene expression measurements, or other subgroups of variables separately and testing them for pairwise correlation (implying, for example, linkage disequilibrium or first-order epistatic interaction in case of SNPs) before reconstructing the BN. This said, majority of BN reconstruction algorithms (Neapolitan, 2004), including the one detailed in this communication, reside somewhere in between basic greedy search and exhaustive search (much closer to the former for any sufficiently large datasets), and comprehensive simulation studies are necessary to ascertain their relative performance and robustness within different application domains. The default scoring measure used by BNOmics is MDL/BIC K2 (Cooper and Herskowitz, 1992), a metric directly proportional to the probability of observing the data given the BN structure (i.e., marginal likelihood) that also penalizes explicitly for the model complexity (Schwarz, 1978). Alternatives include AIC (Akaike, 1974) and different flavors of MDL/BIC (Rissanen, 1978; Suzuki, 1999). In brief, AIC and BIC are both MDL with the difference in the complexity (penalty) term, while K2 is BDM (Bayesian Dirichlet Metric) due to the dirichlet distributional assumption, which makes it potentially not as robust as assumption-free MDL. These are realized in BNOmics as C++/ctypes modules. A novel entropy-based scoring function, similar to MDL, but more flexible with respect to the model complexity penalty, is also available, but it has not been extensively tested yet and will be described in detail elsewhere. 3.4. Output format and visualization BNOmics outputs reconstructed BNs in DOT graph description markup language format. Subsequently, DOT files can be edited manually and visualized using Graphviz (open-source graph visualization software, http://www.graphviz.org). Graphviz can apply numerous network layouts to generate publication-quality illustrations. It is also possible to visualize smaller fragments of a BN (e.g., an immediate Markov neighborhood of a certain radius of a node of interest, such as phenotype), which is essential when visualizing BNs reconstructed from large datasets. In parallel, this allows Markov set-based classification (serving as both a variable selection routine and a classifier similar to Naïve Bayes). 4. Results and Discussion 4.1. Example applications Three applications below illustrate various aspects of BN analyses using BNOmics (namely exhaustive BN structure search, scalability, and visualization, and combining heterogeneous omics data and data types). The 2nd and 3rd applications relate to the Atherosclerosis Risk in Communities (ARIC), a comprehensive epidemiologic study of coronary heart disease (CHD), and its risk factors (ARIC investigators, 1989) and are detailed strictly from the analysis methodology prospective, with the biological results to be discussed elsewhere. 4.1.1. Variation in apolipoprotein E gene and plasma lipid and apolipoprotein E levels This application illustrates exhaustive search in smaller BNs. Two datasets are included with the software distribution (available directly from the authors), both in self-explanatory .csv comma-delimited format and with SNP values recoded in 0,1,2,…format. These datasets were generated in a genetic epidemiology study of variation in the apolipoprotein E (APOE) gene and plasma lipid and APOE levels and were described in detail in Nickerson et al. (2000) and Stengård et al. (2002). Briefly, 20 SNPs in the APOE gene were genotyped in 702 African Americans from Jackson, Mississippi, and 854 non-Hispanic whites from Rochester, Minnesota. The datasets also contain plasma lipid and lipoprotein measurements and basic epidemiological variables. Three SNPs (designated #3937, #4075, and #4036 in the datasets) located in the coding region of the gene are associated with various phenotypes of interest. Importantly, SNPs, #3937 and #4075, code for the E2, E3, and E4 APOE isoforms. Plasma APOE level was the primary phenotype of interest in the original study. Therefore, the immediate Markov neighborhood of the BN node associated with the APOE plasma level variable received most scrutiny after the network reconstruction. The reconstructed BNs are shown in Figure 2 (Fig. 2a, African Americans; Fig. 2b, non-Hispanic whites). We refer the reader to Rodin et al. (2005) and Rodin et al. (2012) for the biological interpretation of the networks and will remark instead on the technical aspects of the network reconstruction and visualization. FIG. 2. BNs reconstructed from the APOE datasets. (a) African Americans from Jackson, Mississippi, (b) non-Hispanic whites from Rochester, Minnesota. Numbers next to BN edges indicate edge strengths. See text for interpretation of edge strength and disconnected notes. APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Number nodes indicate corresponding APOE SNPs. APOE, apolipoprotein E; SNP, single-nucleotide polymorphism. The BNs in Figure 2 were reconstructed using MDL scoring metric. The continuous variables were discretized in three bins using entropy-based discretization with MDL stopping criterion. MDL scoring metric imposes a relatively high penalty on the model complexity, resulting in the reconstructed BNs being comparatively sparse (i.e., slightly underfitting). By varying the coefficients in the MDL/AIC metrics, degree of overfitting can be adjusted. The search continued until full exhaustion (bound by the machine epsilon, ∼2−53). The absence of any edges (hanging nodes at the top of Fig. 2a, b) indicates that the corresponding nodes are independent of other variables. A number next to the edge is proportional to the ratio of the model score of the BN with the edge to that of the BN without and quantifies the edge strength (also reflected in the edge color or thickness, which are some of the output options). In both networks, the Markov neighborhoods of the node, APO_E, are consistent with what is known about the biology of ApoE and genetic epidemiology of APOE variation. In fact, the primary reason behind using these two datasets as a benchmark example application throughout the BNOmics development is that the APOE system can serve as a well-established true positive control with effects of known strength. As we change different algorithmic parameters and software settings, it is possible to study the robustness, sensitivity, and specificity of BN reconstruction by comparing the resulting APOE networks. For example, while similar BNs were obtained by us previously using different scoring metric and search engines (Rodin et al., 2005; Rodin et al., 2012), current BNs (Fig. 2) show improvement in robustness due to the exhaustive model selection process. On a related note, we also applied BNOmics to the commonly used benchmark datasets, Alarm (Beinlich et al., 1989), Asia (Lauritzen and Spiegelhalter, 1988), and other well-known small datasets (see http://www.bnlearn.com/for repository), resulting in networks similar or identical to the actual BNs and BNs reconstructed by comparable software packages (results not shown). 4.1.2. GWAS analysis in ARIC study This application demonstrates BNOmics’ scalability in both BN reconstruction and visualization. The ARIC study is a population-based prospective cohort study of CHD and its risk factors (ARIC Investigators, 1989). ARIC recruited 15,792 non-Hispanic white and African American individuals aged 45–64 years at baseline (1987–89), chosen by probability sampling. The phenotypes of interest to the ARIC study include (among many others) CHD endpoint events (CHD deaths, myocardial infarction, and hospitalized congestive heart failure), stroke events, metabolic syndrome, diabetes, blood pressure, and blood lipids. DNA samples (∼900,000 SNPs genotyped using the Affymetrix 6.0 chip) have been collected on all members of the ARIC cohort. When reconstructing BNs from the ARIC GWAS data, we were particularly interested in blood lipid phenotypes, thus concentrating on visualization of the subnetworks containing third-, second-, and first-order (radius) Markov neighborhoods of blood lipid variables (shown in Fig. 3a–c). It should be emphasized that the single reconstructed BN was built from all ∼900,000 SNP variables available—it is not shown for obvious reasons as third-order Markov neighborhood subnetwork is already almost impossible to visualize. Figure 3d shows a subnetwork containing non-SNP variables only. FIG. 3. (a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study. 4.1.3. Metabolomic analysis in ARIC study This example illustrates BNOmics’ ability to deal with heterogeneous (mixed) and unknown data types. One of the major challenges with data of this nature is that most variables are continuous with largely unknown (and different) distributions. Detailed description of metabolomic profile capture in a subset of ARIC study participants can be found in Zheng et al. (2014). Briefly, biochemical profiles of human sera (∼600 metabolites) from ∼2000 ARIC African American individuals were obtained using current Metabolon platform (untargeted mass spectrometry). Approximately 350 known and ∼250 unknown metabolites were measured. For the purposes of this application, primary phenotypes of interest were hypertension and heart failure (common in this subcohort). Analysis goals were to reconstruct the complete BN, to investigate the immediate network neighborhoods of hypertension and heart failure nodes, and to possibly assist in identifying the unknown metabolites; 208 reliably reproducible metabolites were included in the final BN analysis, together with relevant epidemiological variables. The full BN is shown in Figure 4a; Figure 4b depicts immediate (first-order) Markov neighborhood of hypertension phenotype variable. One unknown metabolite (X–11372) was indicated and is currently being followed up. FIG. 4. (a) BN reconstructed from the ARIC metabolomic profile dataset. (b) Visualization of a first-order (radius) Markov neighborhood subnetwork of hypertension phenotype node (HYPERT05). Numbers next to BN edges indicate edge strengths. Epidemiological and known metabolite node designations are largely self-explanatory (e.g., V1AGE01, glycerol). X—<…> nodes indicate unknown metabolites. See Zheng et al., 2014, for more detail. 4.2. Scalability We have successfully applied BNOmics to various large-scale GWAS and metabolomic datasets, scalability being limited only by the hardware (memory) and preset computational time limits. Specifically, constructing a robust (converged) BN from a 900,000 SNP GWAS dataset took about 7 days on a regular 16GB 8-core workstation; a dataset with ∼100,000 variables required ∼27 hours, ∼10,000 variables required ∼10 hours (note that all of the efficiency improvements outlined above in the Data Storage and Input Format section were implemented in a manner specific to each dataset). Therefore, we suggest that potential users experiment with maxing out BNOmics on their respective hardware platforms. We also found scipy weave package (that allows inclusion of C/C++ into Python code) to be potentially useful when dealing with extremely large datasets. At this time, there is only limited provision for parallelization, but an improved version of the algorithm that can take advantage of parallel computing is currently in the works. This becomes especially relevant with the number of samples >106. Currently, coarse multithreading/parallelization at the ancestor search level does not seem to provide significant benefit due to the fact that data passing to threads dominate the process for large sample sizes, and the resulting process tends to be even less efficient than the serial version of the algorithm. A finer granularity level for multithreading/parallelization is necessary. Concurrently, in future, we plan to switch to the contiguous C arrays for improved data storage and retrieval. In general, we intend to recode the BNOmics prototype completely in C and maintain two versions—Python version for BN modeling research and methods testing and C version for the actual data analyses. This said, BNOmics is already more efficient than commonly used general-purpose BN reconstruction packages (e.g., bnlearn, http://www.bnlearn.com/and pebl, http://code.google.com/p/pebl-project/). It should be reemphasized that while BNOmics outputs (in DOT language) a complete BN, for sufficiently large networks, it is impractical to deal with the output in its entirety (indeed, it makes no sense to generate a .pdf file visualizing a network with more than 100–200 nodes). Therefore, users are advised to concentrate on the smaller subnetworks by generating and visualizing lists of parents and children (i.e., immediate ancestors and descendants, respectively) of certain nodes of interest, such as phenotypes (see typical examples in preceding section), or subgraphs defined by the order/radius of relationship (a certain number of generations up and down family tree). In future, we plan to extend this feature by adding the adjacency matrix representation of the ancestor–descendant relationships. 4.3. Future directions We are committed to the continuing open-source flexible code development of BNOmics. While the basic modular organization of the package will remain intact, we plan to introduce major improvements in usability and efficiency in response to the increasing availability of very large-scale datasets (that are presently being generated as part of our ongoing research projects). We are also soliciting ideas and suggestions for improvements from the greater community of potential users, with a special emphasis on (but certainly not limited to) automated format conversion for diverse datasets. We are currently carrying out the analyses of epigenetic (methylome) and immune system datasets, to name just a couple. Another area of future concentration is using BNOmics to vary and compare, in the context of comprehensive simulation studies, different BN reconstruction alternatives with respect to scoring functions/metrics, local search/optimization algorithms, and discretization procedures. The Python code was developed from the very beginning with such simulation studies in mind as various changes in the reconstruction scheme can be incorporated on the fly. In this sense, BNOmics can be used not only as a data analysis tool but also as a platform for the investigation, and improvement, of different aspects of the BN modeling process. Work currently in progress is itemized in Table 1. Table 1. Work on the Algorithm/Software Improvements Currently in Progress 5. Conclusions Three features of BNOmics set it apart from comparable alternatives—first, its high computational efficiency and scalability; second, flexibility and open nature of the source code; and third, its immediate applicability to the large-scale datasets generated by the omics studies. BNOmics has been very useful in our own research projects and collaborations. Currently, there is a substantial interest in applying systems biology thinking and analysis methods to the large-scale omics data (Qi et al., 2014; Agostinho et al., 2015; Sherif et al., 2015; Marini et al., 2015; Yin et al., 2015b; Kaiser et al., 2016). However, the assortment of workable systems biology data analysis tools is very limited especially if the ultimate goal is reverse engineering of biological networks from the massive flat datasets. BN is a useful paradigm for biological network reconstruction and modeling, and BNOmics is a powerful implementation thereof. 6. Availability and Requirements Project name: BNOmics Project home page: TBA (City of Hope), available directly from the authors or at Bitbucket (baseline implementation) Operating system(s): Source code is available for any standard Python implementation Programming language: Python, C/C++ Other requirements: Python interpreter, linear algebra libraries (numpy), and optional C/C++ code inclusion libraries (scipy) License: Worldwide nonexclusive, standard open-source

Document structure show

Title	New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data
Abstract	Abstract Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology—type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types—single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc.
Title	Abstract
Body	1. Background Continuous progress in the development of high-throughput genotyping and sequencing technologies led to the information overload problem that is likely to get exacerbated as the tools become even more accessible. Translated into the data analysis vernacular, the challenge is essentially threefold: (1) increasing the scalability of the data analysis methodology to accommodate large-scale datasets and (2) incorporating many heterogeneous data types into the analysis framework, while (3) being able to account for, and interpret, nonadditive interactions between the variables (genetic and otherwise). These three components are closely interrelated, in that trying to build more complex and expressive models (containing many data types and allowing for high-order variable interactions) necessarily and severely compromises scalability. Indeed, baseline analysis of a typical genetic epidemiology dataset, for example, one million univariate single-nucleotide polymorphism (SNP)–phenotype tests corrected for multiple comparisons, generated by a genome-wide association study (GWAS) is computationally undemanding, but does not even begin to address the latter two issues, while more sophisticated analysis methods tend to be NP-hard and therefore computationally infeasible for the datasets containing large numbers of variables. Many of such methods belong to the domain of systems biology data analysis. Their primary goal is to grasp the underlying biological system in its entirety, including the high-order interactions between the variables contained in the data, potentially leading to the generation of new biological hypotheses of varying levels of complexity. Coincidentally, additional challenges arise in (1) presenting the output of the systems biology data analysis method in a format that can be understood and interpreted by a human expert in a specific biomedical research field (e.g., genetic epidemiology of particular trait/disease in case of GWAS genotype–phenotype mapping) and (2) guarding against overfitting/overparameterization (learning spurious relationships from the datasets of unfavorable dimensionality, that is, the too many variables—too few observations problem). A prototypical systems biology data analysis method is Bayesian network (BN) modeling, in which a graphical model (BN) of joint probability distribution of random variables contained in the dataset is reconstructed directly from the data. All observed biological variables and measurements are understood to have a probabilistic nature. Thus, in BNs, nodes correspond to random variables and edges to dependencies. The edges are directed, reflecting (sometimes arbitrary, strictly for the purposes of mathematical convenience) ancestor–descendant relationships between the variables. This directionality is important as it defines a unique representation for the multiplicative partitioning of the joint probability and, subsequently, a direction of inference in the BN once a BN structure (topology) is inferred. The absence of an edge between the two variables indicates conditional independence (although this cannot be strictly guaranteed in practice). BN modeling, and network implementation in general, has been extensively used in genetics, bioinformatics, and computational biology since the turn of the century (Friedman et al., 2000; Pe'er, 2005; Rodin et al., 2005; Djebbari and Quackenbush, 2008; Rodin et al., 2012; Liu et al., 2014; Lo et al., 2015; Tasaki et al., 2015; Li et al., 2016, to name but a few). A detailed treatment of BN methodology, while outside of the scope of this communication, can be found in Pearl (1988), Pearl (2000), Heckerman (1995), and Chickering et al. (2004). We also refer the reader to Rodin et al. (2012) and references therein for interpretation of ancestor–descendant relationships, causality, directionality, and BN validation in the genetic epidemiology context (defined by datasets containing many SNP variables and phenotype and epidemiological variables, among others). Briefly, the advantages of the BN modeling over simpler, less expressive, data analysis methods are (1) the ability to incorporate many different data types into analysis, (2) the ability to account for high-order variable interactions (e.g., epistatic and gene environment), and (3) an output (a graphical network model) that is easily understood and interpreted by a human expert. In addition, unlike certain other methods operating in Euclidean space, the BN approach is context independent and has a number of attractive theoretical properties allowing mixing of different data types in a theoretically sound probabilistic framework. In addition, due to the natural sparseness of biological systems (i.e., each node in a network being directly connected with a limited number of other nodes), the resulting BN models are relatively easy to compartmentalize, which augurs well for both reconstruction and visualization scalability. Thus, BNs are an excellent biological modeling and hypothesis generation tool. Three major practical difficulties associated with BN modeling in our research context are (1) limited scalability, at least compared with the more simplistic analyses, (2) absence of the readily available software aimed primarily at the biomedical data that can take advantage of the known genetic data type structures and various omics formats, and (3) interpretation, validation (e.g., using statistical resampling), and visualization of larger BNs. A specific, but persistent, challenge, on both theoretical and practical levels, is combining continuous and discrete variables together into a comprehensive hybrid probability model within the same BNs. 2. Existing Algorithms and Software Packages Two useful BN reconstruction software lists can be found at (http://www.cs.ubc.ca/∼murphyk/Software/bnsoft.html and http://www.kdnuggets.com/software/bayesian.html). These include both commercial and free general-purpose BN modeling software and BN-based classifiers. A more selective list of free packages, compared in the bioinformatics application context, is compiled in Paluszewski and Hamelryck (2010), with a special emphasis on the dynamic BNs (DBNs). pwOmics (Watcher and Beisbarth, 2015) is the most recent implementation of DBN modeling in omics data context. Another recent BN reconstruction algorithm (Jiang et al., 2010a, 2010b; Jiang and Neapolitan, 2012) is of special interest as the authors attempted to increase the scalability of BN modeling to make it directly usable with the large-scale genetic epidemiology datasets. However, it has limitations related to the constraints imposed on the general BN structure and/or on the variables effectively preselected for BN analysis. This work had been followed up in Neapolitan et al. (2014), Jiang and Neapolitan (2015), and Jiang et al. (2015), but remains subject to these and related limitations. Large-scale genetic epidemiology dataset BN analysis was also pursued in Han et al. (2012) at the cost of specifying a single target variable. BN Webserver (Ziebarth et al., 2013) is a comprehensive biological BN analysis tool, which, among other things, efficiently deals with hybrid models (heterogeneous variables/data types) in a biological user-friendly manner; unfortunately, its scalability is essentially nonexistent (<20 nodes). Constructing multilevel gene regulatory networks (Guan et al., 2014) aims at ChiP-seq and gene expression data, but suffers from the same major shortcoming (limited scalability). A more theoretically rigorous approach of effectively reducing BN to a Markov neighborhood of a variable of interest (Gao and Ji, 2016) is intriguing, but has not been deployed in actuality. Similarly, a theoretically attractive approach to inferring causality via intervention data (Cho et al., 2016) suffers, once again, from low scalability and limited deployment. In general, theoretical rigor and distributional flexibility on one hand and scalability on the other tend to be mutually exclusive (see Yin et al., 2015a, for another recent example). As an important aside, when developing BNOmics, complete code transparency was a priority. This makes it much easier to change and augment the BN reconstruction engine (local search/optimization algorithm) on the fly. Therefore, BNOmics is explicitly designed to be sufficiently flexible to incorporate different variations of baseline search algorithms, network scoring functions, and discretization and imputation approaches. As such, BNOmics engine is ideally suited to be incorporated into a typical comparative simulation study framework. It should be emphasized that first and foremost, BNOmics is a prototype/proof-of-concept design of a research platform prioritizing simplicity, flexibility, and adaptability to various biomedical data analysis applications rather than an overly complex production-level software package with all imaginable options and extensions. 3. Algorithm and Implementation BNOmics is realized as a series of Python scripts, including the data formatting and storage facilities, actual BN reconstruction engine, output interface, and various optional support routines (data reformatting plug-ins). A Python interpreter with a standard set of modules as well as additional numerical libraries (numpy) is required to run the software. Help (readme) files and the example input data files (see section 4 for the example application) are provided as part of the distribution. Computationally, most intensive parts of BN reconstruction engine are implemented in C++ using ctypes interface. 3.1. Data storage and input format The input data file is a plain, flat (variables by observations/individuals) text file in a format similar to the typical comma-delimited spreadsheet export file. Loading from other common file formats, streams, and strings is also supported. Because the basic BN reconstruction algorithm uses multinomial local probability model, in the baseline implementation, discretization of continuous variables is necessary (but see section 3.2). Optional scripts are available for automated input file generation, including common discretization procedures (equal size bins, equal value ranges, entropy-based discretization, etc.). In the context of genetic epidemiology datasets, most variables are discrete by nature (e.g., SNPs, allelic states); however, one should be careful when discretizing continuous phenotypes or, for example, metabolomic measurements. Therefore, if possible, user-driven manual or semimanual discretization is advised (and can be easily accomplished on the fly within the Python environment—it is precisely the flexibility of such nature that led us to choose Python over other languages). Similarly, we advise carrying out user-driven missing value imputation before engaging the BNOmics software—although optional imputation routines (using majority, frequency, and proximity rules) are available, sensible imputation is highly dependent on the specific data type and quality control procedures implemented during the data generation stage. For example, when analyzing metabolomic data, it is difficult to distinguish between the metabolite measurement value missing due to a technical error, low metabolite concentration, or the actual metabolite absence in the sample. Such technical artifacts have to be dealt with manually or semimanually, and with large datasets, the only practical way to do so is to algorithmically parse the data (which, again, is easily achieved by using a Python interpreter as a universal control interface). A more advanced imputation algorithm, based on the local probabilistic inference in the immediate network neighborhood of a variable in question (with missing values), is also available, but its properties have not been comprehensively assessed yet and it will be described in detail elsewhere (A potential concern with such approaches is their general susceptibility to overfitting.). There is no explicit algorithmic or software limit imposed on the input dataset size (number of variables and observations/individuals); however, data files larger than 2–3 GB are not recommended for a typical workstation (16 GB memory or less) installation without certain modifications (to keep it in perspective, 500K variables-strong GWAS dataset containing ∼2000 case/controls fits in just under 1.5GB). To optimize the data storage, retrieval, and memory I/O access in these situations, the actual data (single-type variable values) have to be stored separately from the annotation file, row-wise, following the approach espoused by Nielsen and Mailund (2008) to compensate for disk-bound I/O latency. When dealing with extremely large data files, some form of batch data access may be required, with the whole segments of the file loaded directly into RAM (which explains the need for row-wise data structure). It should be noted that these issues are very architecture and problem specific and are best addressed on a case-by-case basis within Python interface. Among the implemented speed/scalability improvement measures is the integer-type encoding and representation of the data in memory, which leads to a smaller footprint, a faster access, and an ability to directly apply a number of efficient numerical operations. The input dataset entries are coded in 0, 1, 2,… format. Further improvements are a result of an optimized algorithm design (section 3.3). Together, these measures increase computational efficiency by approximately an order or two of magnitude (depending on the data and compared with a typical BN software implementation) without imposing restrictions on BN structure. It remains to be mentioned that current limit on the number of samples is flexible and depends on hardware. In a recent application, for example, ∼107 samples (by 100+ variables), dataset analysis (results not shown) was completed in under 24 hours. While the algorithm is, in principle, linear with respect to the number of samples, at some point it becomes a bottleneck (see section 4.2). 3.2. Continuous and discrete variables In baseline implementation, due to the limitations of linear Gaussian (for continuous variables) and hybrid BN local probability models (Friedman et al., 2000; Pe'er, 2005), continuous variables in the dataset have to be discretized. This is customary not just for BN modeling but also for many other data analysis methods. However, there are no commonly agreed upon discretization guidelines or standards in bioinformatics, and existing research tends to be very microarray data-centric (Vass et al., 2011). This paucity can, in large part, be attributed to our limited understanding of the theoretical motivation behind the discretization. Indeed, the actual purpose of discretization is twofold: first, to come up with the partitioning (into bins or events) that best reflects characteristic features of the distribution of a continuous variable, and second, to preserve the intervariable relationships (correlation, dependency) and their relative strengths for subsequent data analysis by the algorithms or statistical tests developed for discrete variables. While these two goals obviously overlap, our primary interest lies with the latter and not the former. For example, in the context of genetic epidemiology, we should aim to discretize the continuous variables, such as quantitative trait (QT) endpoints and intermediate phenotypes, in a manner that maintains true relative strengths of SNP-phenotype associations. Consequently, our considerations are motivated by the fact that partitioning is equivalent to selecting only a few events from the sigma algebra of events associated with the distribution in question, and no selection is any more representative than any other when it comes to establishing conditional independence because it has to be established across all events. Another way of looking at it is while any Borel measurable (reasonable and deterministic) partitioning function is independence-preserving, sensu stricto independence of partitioned variables implies nothing with regard to the independence of original variables. On the other hand, to falsify the independence assumption, it is sufficient to have at least one dependent event, which, however, is not known a priori. Hence, the only reasonably efficient way to approach the problem is from the perspective of falsification of the independence assumption, which is significantly less time-consuming than verifying independence over all possible events. In general, our principal proposition for manual or semimanual discretization is that the simplest discretization method (that satisfies basic common sense data type-specific requirements) should be chosen. While partitioning into the fewer bins potentially might be perceived as leading to increased information loss, this is, in fact, only an illusion generated by poor interpretability of often complicated conditional independence, which does not care about the finer features nearly as much as about the change in the conditional distribution over multiple scales (when conditioned on new variables). What matters the most is how the bins/events of one variable intersect/interact with bins/events of another variable or a set of variables. Moreover, any fears associated with coarser partitioning over the finer partitioning should be counterbalanced by the corresponding decrease in random noise and natural variation customary for the biological variable measurements. At the same time, it has been observed by us and others that partitioning into lower number of bins tends to lead to higher edge density in reconstructed BNs (Clarke and Barton, 2000; Rodin et al., 2005), possibly because BN scoring metrics are biased toward fewer variable values. Regardless, coarser binning can be considered as boosting sensitivity, while finer binning as boosting specificity. Technically, this is due to the fact that the conditional probability distributions with fewer states carry fewer constraints, which are therefore easier to satisfy when it comes to conditional entropy minimization (or conditional probability maximization). Such moderate overfitting, especially in the Markov locality of the continuous variables (e.g., QT phenotype variables in GWAS datasets), is actually beneficial for the purposes of exploratory data analysis (automated hypothesis generation). Typically, discretizing into two or three bins with entropy-based partitioning points (Fayyad and Irani, 1993), as long as it does not explicitly conflict with the observed nature of the continuous variable, is to be preferred. It is also more favorable from the computational and memory utilization efficiency viewpoint. This is the default option in BNOmics (maximum entropy clustering-based discretization into a minimal number of bins). However, there is also a novel option of treating both continuous and discrete variables simultaneously. It is fully functional, but at this time does not scale up as well as purely discrete variable design (see section 4.3). 3.3. BN reconstruction algorithm BN reconstruction is generally a two-stage process, involving model selection (search of the network structure or topology that best fits the data) and probabilistic inference propagation given the fixed network structure. The former is NP-hard and therefore some local or heuristic search algorithm is usually employed (instead of exhaustive search) for any dataset of nontrivial size (starting with ∼20 nodes/variables, exact algorithms and computations become infeasible). The candidate network structures, or models, are evaluated using an objective scoring function (metric). These often incorporate, explicitly or implicitly, a model complexity penalty to prevent overfitting. In addition, an initial search state (network structure prior) has to be specified—selection of an optimal (or biologically meaningful) prior has received much attention (e.g., Friedman et al., 1999; Steele et al., 2009; Keilwagen et al., 2010; Zhang et al., 2014), in part, because it may alleviate the scalability problem to a degree. The default BNOmics algorithm is completely agnostic in regard to the BN structure and prior and therefore is entirely data driven. However, restrictions on the network structure (i.e., forbidding or forcing edges between the network nodes) can easily be accommodated if desired. Once the input data file is initialized, the BNOmics algorithm works as follows: given the data D = {x1,…, xN}, where N is the number of variables, we aim to assign these variables to the N nodes of a BN (Fig. 1). FIG. 1. BN reconstruction algorithm kernel pseudocode (Procedures 1 and 2). BN, Bayesian network. Let I = {1,…,N} be the index set of nodes and πi be the parent set of the i-th node xi. We also introduce a real-valued objective (scoring) function F(xi, πi) that has two parameters, the node and its ancestor set. By changing the ancestor set, we can evaluate different subnetworks centered around the node. The simplest (first-order) operator would be an addition of a node xj to the ancestor set πi of the node xi. By varying j over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ k \in I\it :{x_k} \,\notin \,{ \pi _i}, k \ne i \} $$ \end{document}, we can score each configuration sij = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj). Procedure 1 (Fig. 1) details the process of finding the best single ancestor addition to a node. It returns jmax, the index of a node that (when added as an ancestor) maximizes the scoring function. Thus, it generates the best edge to be added to the network in the immediate Markov neighborhood of xi (in its upper part—the parents of xi), the edge representing a pair of nodes (i,jmax). We call this first-order search as the only allowable operator is adding a single edge. After the addition of an edge, a search for an optimal edge to remove from the ancestor set is performed. If such an edge is found, it is dropped and the necessary changes are propagated through the structure. This combination of add and drop searches allows to maintain a relatively optimal ancestor set (the task of finding an optimal ancestor set is one of the fundamentally difficult problems, and a principal contributor to NP-hardness of network reconstruction, and as such is solved only approximately). Similarly, second-order search allows adding two new edges, evaluating sijk = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xk), and so on. Such higher-order searches are obviously more computationally intensive, and the baseline BNOmics implementation uses first-order search only, although stochastic second-order search has been implemented and tested (second-order search is quadratic O(n2) in the number of variables, and if ∼n pairs of variables are sampled at random from (n2−n)/2 total pairs, the second-order search becomes O(n)). It should be noted that first-order search is still capable of discovering higher-order interactions in the general sense as well as in the more specific biological sense (e.g., epistatic interactions between multiple SNPs in a Markov neighborhood of a phenotype node). Not only that, it is possible to arrive to the same ancestor set using different methods as the only thing that matters from the perspective of the order of accounted for interactions is the ancestor set itself. However, higher-order searches do bring us closer to the exhaustive search ideal (although in practice, they provide only marginal improvement, reflected in finer detail, over other search strategies). Because no network structure (topology) priors are assumed, we start with the empty network; all ancestor sets are initialized to the empty sets, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\leftarrow$$ \end{document} ∅, ∀i\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\,\in I$$ \end{document}. Equivalently, we assume independence of all nodes and then try to falsify this assumption by establishing edges. Then, Procedure 1 is applied to each node to obtain a set of N candidate best edges (i, j) with associated scores (first loop in Procedure 2, Fig. 1). The second loop in Procedure 2 (Fig. 1) selects the highest scoring edge and rearranges the current state of the network accordingly (performs bookkeeping). In effect, Procedure 2 reconstructs the network topology by incremental edge addition (forward selection), creating a partial ordering of the nodes along the way. In that respect, it is an elaborate extension of the basic greedy search (hill-climbing) local search/optimization algorithm. It remains to be noted that different stopping criteria (in addition to the scoring function improvement-based one shown in Fig. 1) can be used; in practice, the second loop of Procedure 2 can be bound by the preset iteration limit (i.e., CPU time) or, for smaller networks, the search can continue until full exhaustion (defined by the largest possible increase of the scoring function being commeasurable with the machine epsilon/precision). In addition to the default gradient descent with constraints, stochastically perturbed restarts are also available as an option (to deal with local minima). Optionally, stochastically perturbed restarts are performed as stochastic relaxation over the structure and subsequent reconvergence until a more optimal structure is found. This algorithm compares favorably with other recent attempts to scale up the BN reconstruction (see section 2 above; also Friedman et al., 1999), in that it does not limit the BN structure flexibility by treating the SNPs, gene expression measurements, or other subgroups of variables separately and testing them for pairwise correlation (implying, for example, linkage disequilibrium or first-order epistatic interaction in case of SNPs) before reconstructing the BN. This said, majority of BN reconstruction algorithms (Neapolitan, 2004), including the one detailed in this communication, reside somewhere in between basic greedy search and exhaustive search (much closer to the former for any sufficiently large datasets), and comprehensive simulation studies are necessary to ascertain their relative performance and robustness within different application domains. The default scoring measure used by BNOmics is MDL/BIC K2 (Cooper and Herskowitz, 1992), a metric directly proportional to the probability of observing the data given the BN structure (i.e., marginal likelihood) that also penalizes explicitly for the model complexity (Schwarz, 1978). Alternatives include AIC (Akaike, 1974) and different flavors of MDL/BIC (Rissanen, 1978; Suzuki, 1999). In brief, AIC and BIC are both MDL with the difference in the complexity (penalty) term, while K2 is BDM (Bayesian Dirichlet Metric) due to the dirichlet distributional assumption, which makes it potentially not as robust as assumption-free MDL. These are realized in BNOmics as C++/ctypes modules. A novel entropy-based scoring function, similar to MDL, but more flexible with respect to the model complexity penalty, is also available, but it has not been extensively tested yet and will be described in detail elsewhere. 3.4. Output format and visualization BNOmics outputs reconstructed BNs in DOT graph description markup language format. Subsequently, DOT files can be edited manually and visualized using Graphviz (open-source graph visualization software, http://www.graphviz.org). Graphviz can apply numerous network layouts to generate publication-quality illustrations. It is also possible to visualize smaller fragments of a BN (e.g., an immediate Markov neighborhood of a certain radius of a node of interest, such as phenotype), which is essential when visualizing BNs reconstructed from large datasets. In parallel, this allows Markov set-based classification (serving as both a variable selection routine and a classifier similar to Naïve Bayes). 4. Results and Discussion 4.1. Example applications Three applications below illustrate various aspects of BN analyses using BNOmics (namely exhaustive BN structure search, scalability, and visualization, and combining heterogeneous omics data and data types). The 2nd and 3rd applications relate to the Atherosclerosis Risk in Communities (ARIC), a comprehensive epidemiologic study of coronary heart disease (CHD), and its risk factors (ARIC investigators, 1989) and are detailed strictly from the analysis methodology prospective, with the biological results to be discussed elsewhere. 4.1.1. Variation in apolipoprotein E gene and plasma lipid and apolipoprotein E levels This application illustrates exhaustive search in smaller BNs. Two datasets are included with the software distribution (available directly from the authors), both in self-explanatory .csv comma-delimited format and with SNP values recoded in 0,1,2,…format. These datasets were generated in a genetic epidemiology study of variation in the apolipoprotein E (APOE) gene and plasma lipid and APOE levels and were described in detail in Nickerson et al. (2000) and Stengård et al. (2002). Briefly, 20 SNPs in the APOE gene were genotyped in 702 African Americans from Jackson, Mississippi, and 854 non-Hispanic whites from Rochester, Minnesota. The datasets also contain plasma lipid and lipoprotein measurements and basic epidemiological variables. Three SNPs (designated #3937, #4075, and #4036 in the datasets) located in the coding region of the gene are associated with various phenotypes of interest. Importantly, SNPs, #3937 and #4075, code for the E2, E3, and E4 APOE isoforms. Plasma APOE level was the primary phenotype of interest in the original study. Therefore, the immediate Markov neighborhood of the BN node associated with the APOE plasma level variable received most scrutiny after the network reconstruction. The reconstructed BNs are shown in Figure 2 (Fig. 2a, African Americans; Fig. 2b, non-Hispanic whites). We refer the reader to Rodin et al. (2005) and Rodin et al. (2012) for the biological interpretation of the networks and will remark instead on the technical aspects of the network reconstruction and visualization. FIG. 2. BNs reconstructed from the APOE datasets. (a) African Americans from Jackson, Mississippi, (b) non-Hispanic whites from Rochester, Minnesota. Numbers next to BN edges indicate edge strengths. See text for interpretation of edge strength and disconnected notes. APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Number nodes indicate corresponding APOE SNPs. APOE, apolipoprotein E; SNP, single-nucleotide polymorphism. The BNs in Figure 2 were reconstructed using MDL scoring metric. The continuous variables were discretized in three bins using entropy-based discretization with MDL stopping criterion. MDL scoring metric imposes a relatively high penalty on the model complexity, resulting in the reconstructed BNs being comparatively sparse (i.e., slightly underfitting). By varying the coefficients in the MDL/AIC metrics, degree of overfitting can be adjusted. The search continued until full exhaustion (bound by the machine epsilon, ∼2−53). The absence of any edges (hanging nodes at the top of Fig. 2a, b) indicates that the corresponding nodes are independent of other variables. A number next to the edge is proportional to the ratio of the model score of the BN with the edge to that of the BN without and quantifies the edge strength (also reflected in the edge color or thickness, which are some of the output options). In both networks, the Markov neighborhoods of the node, APO_E, are consistent with what is known about the biology of ApoE and genetic epidemiology of APOE variation. In fact, the primary reason behind using these two datasets as a benchmark example application throughout the BNOmics development is that the APOE system can serve as a well-established true positive control with effects of known strength. As we change different algorithmic parameters and software settings, it is possible to study the robustness, sensitivity, and specificity of BN reconstruction by comparing the resulting APOE networks. For example, while similar BNs were obtained by us previously using different scoring metric and search engines (Rodin et al., 2005; Rodin et al., 2012), current BNs (Fig. 2) show improvement in robustness due to the exhaustive model selection process. On a related note, we also applied BNOmics to the commonly used benchmark datasets, Alarm (Beinlich et al., 1989), Asia (Lauritzen and Spiegelhalter, 1988), and other well-known small datasets (see http://www.bnlearn.com/for repository), resulting in networks similar or identical to the actual BNs and BNs reconstructed by comparable software packages (results not shown). 4.1.2. GWAS analysis in ARIC study This application demonstrates BNOmics’ scalability in both BN reconstruction and visualization. The ARIC study is a population-based prospective cohort study of CHD and its risk factors (ARIC Investigators, 1989). ARIC recruited 15,792 non-Hispanic white and African American individuals aged 45–64 years at baseline (1987–89), chosen by probability sampling. The phenotypes of interest to the ARIC study include (among many others) CHD endpoint events (CHD deaths, myocardial infarction, and hospitalized congestive heart failure), stroke events, metabolic syndrome, diabetes, blood pressure, and blood lipids. DNA samples (∼900,000 SNPs genotyped using the Affymetrix 6.0 chip) have been collected on all members of the ARIC cohort. When reconstructing BNs from the ARIC GWAS data, we were particularly interested in blood lipid phenotypes, thus concentrating on visualization of the subnetworks containing third-, second-, and first-order (radius) Markov neighborhoods of blood lipid variables (shown in Fig. 3a–c). It should be emphasized that the single reconstructed BN was built from all ∼900,000 SNP variables available—it is not shown for obvious reasons as third-order Markov neighborhood subnetwork is already almost impossible to visualize. Figure 3d shows a subnetwork containing non-SNP variables only. FIG. 3. (a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study. 4.1.3. Metabolomic analysis in ARIC study This example illustrates BNOmics’ ability to deal with heterogeneous (mixed) and unknown data types. One of the major challenges with data of this nature is that most variables are continuous with largely unknown (and different) distributions. Detailed description of metabolomic profile capture in a subset of ARIC study participants can be found in Zheng et al. (2014). Briefly, biochemical profiles of human sera (∼600 metabolites) from ∼2000 ARIC African American individuals were obtained using current Metabolon platform (untargeted mass spectrometry). Approximately 350 known and ∼250 unknown metabolites were measured. For the purposes of this application, primary phenotypes of interest were hypertension and heart failure (common in this subcohort). Analysis goals were to reconstruct the complete BN, to investigate the immediate network neighborhoods of hypertension and heart failure nodes, and to possibly assist in identifying the unknown metabolites; 208 reliably reproducible metabolites were included in the final BN analysis, together with relevant epidemiological variables. The full BN is shown in Figure 4a; Figure 4b depicts immediate (first-order) Markov neighborhood of hypertension phenotype variable. One unknown metabolite (X–11372) was indicated and is currently being followed up. FIG. 4. (a) BN reconstructed from the ARIC metabolomic profile dataset. (b) Visualization of a first-order (radius) Markov neighborhood subnetwork of hypertension phenotype node (HYPERT05). Numbers next to BN edges indicate edge strengths. Epidemiological and known metabolite node designations are largely self-explanatory (e.g., V1AGE01, glycerol). X—<…> nodes indicate unknown metabolites. See Zheng et al., 2014, for more detail. 4.2. Scalability We have successfully applied BNOmics to various large-scale GWAS and metabolomic datasets, scalability being limited only by the hardware (memory) and preset computational time limits. Specifically, constructing a robust (converged) BN from a 900,000 SNP GWAS dataset took about 7 days on a regular 16GB 8-core workstation; a dataset with ∼100,000 variables required ∼27 hours, ∼10,000 variables required ∼10 hours (note that all of the efficiency improvements outlined above in the Data Storage and Input Format section were implemented in a manner specific to each dataset). Therefore, we suggest that potential users experiment with maxing out BNOmics on their respective hardware platforms. We also found scipy weave package (that allows inclusion of C/C++ into Python code) to be potentially useful when dealing with extremely large datasets. At this time, there is only limited provision for parallelization, but an improved version of the algorithm that can take advantage of parallel computing is currently in the works. This becomes especially relevant with the number of samples >106. Currently, coarse multithreading/parallelization at the ancestor search level does not seem to provide significant benefit due to the fact that data passing to threads dominate the process for large sample sizes, and the resulting process tends to be even less efficient than the serial version of the algorithm. A finer granularity level for multithreading/parallelization is necessary. Concurrently, in future, we plan to switch to the contiguous C arrays for improved data storage and retrieval. In general, we intend to recode the BNOmics prototype completely in C and maintain two versions—Python version for BN modeling research and methods testing and C version for the actual data analyses. This said, BNOmics is already more efficient than commonly used general-purpose BN reconstruction packages (e.g., bnlearn, http://www.bnlearn.com/and pebl, http://code.google.com/p/pebl-project/). It should be reemphasized that while BNOmics outputs (in DOT language) a complete BN, for sufficiently large networks, it is impractical to deal with the output in its entirety (indeed, it makes no sense to generate a .pdf file visualizing a network with more than 100–200 nodes). Therefore, users are advised to concentrate on the smaller subnetworks by generating and visualizing lists of parents and children (i.e., immediate ancestors and descendants, respectively) of certain nodes of interest, such as phenotypes (see typical examples in preceding section), or subgraphs defined by the order/radius of relationship (a certain number of generations up and down family tree). In future, we plan to extend this feature by adding the adjacency matrix representation of the ancestor–descendant relationships. 4.3. Future directions We are committed to the continuing open-source flexible code development of BNOmics. While the basic modular organization of the package will remain intact, we plan to introduce major improvements in usability and efficiency in response to the increasing availability of very large-scale datasets (that are presently being generated as part of our ongoing research projects). We are also soliciting ideas and suggestions for improvements from the greater community of potential users, with a special emphasis on (but certainly not limited to) automated format conversion for diverse datasets. We are currently carrying out the analyses of epigenetic (methylome) and immune system datasets, to name just a couple. Another area of future concentration is using BNOmics to vary and compare, in the context of comprehensive simulation studies, different BN reconstruction alternatives with respect to scoring functions/metrics, local search/optimization algorithms, and discretization procedures. The Python code was developed from the very beginning with such simulation studies in mind as various changes in the reconstruction scheme can be incorporated on the fly. In this sense, BNOmics can be used not only as a data analysis tool but also as a platform for the investigation, and improvement, of different aspects of the BN modeling process. Work currently in progress is itemized in Table 1. Table 1. Work on the Algorithm/Software Improvements Currently in Progress 5. Conclusions Three features of BNOmics set it apart from comparable alternatives—first, its high computational efficiency and scalability; second, flexibility and open nature of the source code; and third, its immediate applicability to the large-scale datasets generated by the omics studies. BNOmics has been very useful in our own research projects and collaborations. Currently, there is a substantial interest in applying systems biology thinking and analysis methods to the large-scale omics data (Qi et al., 2014; Agostinho et al., 2015; Sherif et al., 2015; Marini et al., 2015; Yin et al., 2015b; Kaiser et al., 2016). However, the assortment of workable systems biology data analysis tools is very limited especially if the ultimate goal is reverse engineering of biological networks from the massive flat datasets. BN is a useful paradigm for biological network reconstruction and modeling, and BNOmics is a powerful implementation thereof. 6. Availability and Requirements Project name: BNOmics Project home page: TBA (City of Hope), available directly from the authors or at Bitbucket (baseline implementation) Operating system(s): Source code is available for any standard Python implementation Programming language: Python, C/C++ Other requirements: Python interpreter, linear algebra libraries (numpy), and optional C/C++ code inclusion libraries (scipy) License: Worldwide nonexclusive, standard open-source
Section	1. Background Continuous progress in the development of high-throughput genotyping and sequencing technologies led to the information overload problem that is likely to get exacerbated as the tools become even more accessible. Translated into the data analysis vernacular, the challenge is essentially threefold: (1) increasing the scalability of the data analysis methodology to accommodate large-scale datasets and (2) incorporating many heterogeneous data types into the analysis framework, while (3) being able to account for, and interpret, nonadditive interactions between the variables (genetic and otherwise). These three components are closely interrelated, in that trying to build more complex and expressive models (containing many data types and allowing for high-order variable interactions) necessarily and severely compromises scalability. Indeed, baseline analysis of a typical genetic epidemiology dataset, for example, one million univariate single-nucleotide polymorphism (SNP)–phenotype tests corrected for multiple comparisons, generated by a genome-wide association study (GWAS) is computationally undemanding, but does not even begin to address the latter two issues, while more sophisticated analysis methods tend to be NP-hard and therefore computationally infeasible for the datasets containing large numbers of variables. Many of such methods belong to the domain of systems biology data analysis. Their primary goal is to grasp the underlying biological system in its entirety, including the high-order interactions between the variables contained in the data, potentially leading to the generation of new biological hypotheses of varying levels of complexity. Coincidentally, additional challenges arise in (1) presenting the output of the systems biology data analysis method in a format that can be understood and interpreted by a human expert in a specific biomedical research field (e.g., genetic epidemiology of particular trait/disease in case of GWAS genotype–phenotype mapping) and (2) guarding against overfitting/overparameterization (learning spurious relationships from the datasets of unfavorable dimensionality, that is, the too many variables—too few observations problem). A prototypical systems biology data analysis method is Bayesian network (BN) modeling, in which a graphical model (BN) of joint probability distribution of random variables contained in the dataset is reconstructed directly from the data. All observed biological variables and measurements are understood to have a probabilistic nature. Thus, in BNs, nodes correspond to random variables and edges to dependencies. The edges are directed, reflecting (sometimes arbitrary, strictly for the purposes of mathematical convenience) ancestor–descendant relationships between the variables. This directionality is important as it defines a unique representation for the multiplicative partitioning of the joint probability and, subsequently, a direction of inference in the BN once a BN structure (topology) is inferred. The absence of an edge between the two variables indicates conditional independence (although this cannot be strictly guaranteed in practice). BN modeling, and network implementation in general, has been extensively used in genetics, bioinformatics, and computational biology since the turn of the century (Friedman et al., 2000; Pe'er, 2005; Rodin et al., 2005; Djebbari and Quackenbush, 2008; Rodin et al., 2012; Liu et al., 2014; Lo et al., 2015; Tasaki et al., 2015; Li et al., 2016, to name but a few). A detailed treatment of BN methodology, while outside of the scope of this communication, can be found in Pearl (1988), Pearl (2000), Heckerman (1995), and Chickering et al. (2004). We also refer the reader to Rodin et al. (2012) and references therein for interpretation of ancestor–descendant relationships, causality, directionality, and BN validation in the genetic epidemiology context (defined by datasets containing many SNP variables and phenotype and epidemiological variables, among others). Briefly, the advantages of the BN modeling over simpler, less expressive, data analysis methods are (1) the ability to incorporate many different data types into analysis, (2) the ability to account for high-order variable interactions (e.g., epistatic and gene environment), and (3) an output (a graphical network model) that is easily understood and interpreted by a human expert. In addition, unlike certain other methods operating in Euclidean space, the BN approach is context independent and has a number of attractive theoretical properties allowing mixing of different data types in a theoretically sound probabilistic framework. In addition, due to the natural sparseness of biological systems (i.e., each node in a network being directly connected with a limited number of other nodes), the resulting BN models are relatively easy to compartmentalize, which augurs well for both reconstruction and visualization scalability. Thus, BNs are an excellent biological modeling and hypothesis generation tool. Three major practical difficulties associated with BN modeling in our research context are (1) limited scalability, at least compared with the more simplistic analyses, (2) absence of the readily available software aimed primarily at the biomedical data that can take advantage of the known genetic data type structures and various omics formats, and (3) interpretation, validation (e.g., using statistical resampling), and visualization of larger BNs. A specific, but persistent, challenge, on both theoretical and practical levels, is combining continuous and discrete variables together into a comprehensive hybrid probability model within the same BNs.
Title	1. Background
Section	2. Existing Algorithms and Software Packages Two useful BN reconstruction software lists can be found at (http://www.cs.ubc.ca/∼murphyk/Software/bnsoft.html and http://www.kdnuggets.com/software/bayesian.html). These include both commercial and free general-purpose BN modeling software and BN-based classifiers. A more selective list of free packages, compared in the bioinformatics application context, is compiled in Paluszewski and Hamelryck (2010), with a special emphasis on the dynamic BNs (DBNs). pwOmics (Watcher and Beisbarth, 2015) is the most recent implementation of DBN modeling in omics data context. Another recent BN reconstruction algorithm (Jiang et al., 2010a, 2010b; Jiang and Neapolitan, 2012) is of special interest as the authors attempted to increase the scalability of BN modeling to make it directly usable with the large-scale genetic epidemiology datasets. However, it has limitations related to the constraints imposed on the general BN structure and/or on the variables effectively preselected for BN analysis. This work had been followed up in Neapolitan et al. (2014), Jiang and Neapolitan (2015), and Jiang et al. (2015), but remains subject to these and related limitations. Large-scale genetic epidemiology dataset BN analysis was also pursued in Han et al. (2012) at the cost of specifying a single target variable. BN Webserver (Ziebarth et al., 2013) is a comprehensive biological BN analysis tool, which, among other things, efficiently deals with hybrid models (heterogeneous variables/data types) in a biological user-friendly manner; unfortunately, its scalability is essentially nonexistent (<20 nodes). Constructing multilevel gene regulatory networks (Guan et al., 2014) aims at ChiP-seq and gene expression data, but suffers from the same major shortcoming (limited scalability). A more theoretically rigorous approach of effectively reducing BN to a Markov neighborhood of a variable of interest (Gao and Ji, 2016) is intriguing, but has not been deployed in actuality. Similarly, a theoretically attractive approach to inferring causality via intervention data (Cho et al., 2016) suffers, once again, from low scalability and limited deployment. In general, theoretical rigor and distributional flexibility on one hand and scalability on the other tend to be mutually exclusive (see Yin et al., 2015a, for another recent example). As an important aside, when developing BNOmics, complete code transparency was a priority. This makes it much easier to change and augment the BN reconstruction engine (local search/optimization algorithm) on the fly. Therefore, BNOmics is explicitly designed to be sufficiently flexible to incorporate different variations of baseline search algorithms, network scoring functions, and discretization and imputation approaches. As such, BNOmics engine is ideally suited to be incorporated into a typical comparative simulation study framework. It should be emphasized that first and foremost, BNOmics is a prototype/proof-of-concept design of a research platform prioritizing simplicity, flexibility, and adaptability to various biomedical data analysis applications rather than an overly complex production-level software package with all imaginable options and extensions.
Title	2. Existing Algorithms and Software Packages
Section	3. Algorithm and Implementation BNOmics is realized as a series of Python scripts, including the data formatting and storage facilities, actual BN reconstruction engine, output interface, and various optional support routines (data reformatting plug-ins). A Python interpreter with a standard set of modules as well as additional numerical libraries (numpy) is required to run the software. Help (readme) files and the example input data files (see section 4 for the example application) are provided as part of the distribution. Computationally, most intensive parts of BN reconstruction engine are implemented in C++ using ctypes interface. 3.1. Data storage and input format The input data file is a plain, flat (variables by observations/individuals) text file in a format similar to the typical comma-delimited spreadsheet export file. Loading from other common file formats, streams, and strings is also supported. Because the basic BN reconstruction algorithm uses multinomial local probability model, in the baseline implementation, discretization of continuous variables is necessary (but see section 3.2). Optional scripts are available for automated input file generation, including common discretization procedures (equal size bins, equal value ranges, entropy-based discretization, etc.). In the context of genetic epidemiology datasets, most variables are discrete by nature (e.g., SNPs, allelic states); however, one should be careful when discretizing continuous phenotypes or, for example, metabolomic measurements. Therefore, if possible, user-driven manual or semimanual discretization is advised (and can be easily accomplished on the fly within the Python environment—it is precisely the flexibility of such nature that led us to choose Python over other languages). Similarly, we advise carrying out user-driven missing value imputation before engaging the BNOmics software—although optional imputation routines (using majority, frequency, and proximity rules) are available, sensible imputation is highly dependent on the specific data type and quality control procedures implemented during the data generation stage. For example, when analyzing metabolomic data, it is difficult to distinguish between the metabolite measurement value missing due to a technical error, low metabolite concentration, or the actual metabolite absence in the sample. Such technical artifacts have to be dealt with manually or semimanually, and with large datasets, the only practical way to do so is to algorithmically parse the data (which, again, is easily achieved by using a Python interpreter as a universal control interface). A more advanced imputation algorithm, based on the local probabilistic inference in the immediate network neighborhood of a variable in question (with missing values), is also available, but its properties have not been comprehensively assessed yet and it will be described in detail elsewhere (A potential concern with such approaches is their general susceptibility to overfitting.). There is no explicit algorithmic or software limit imposed on the input dataset size (number of variables and observations/individuals); however, data files larger than 2–3 GB are not recommended for a typical workstation (16 GB memory or less) installation without certain modifications (to keep it in perspective, 500K variables-strong GWAS dataset containing ∼2000 case/controls fits in just under 1.5GB). To optimize the data storage, retrieval, and memory I/O access in these situations, the actual data (single-type variable values) have to be stored separately from the annotation file, row-wise, following the approach espoused by Nielsen and Mailund (2008) to compensate for disk-bound I/O latency. When dealing with extremely large data files, some form of batch data access may be required, with the whole segments of the file loaded directly into RAM (which explains the need for row-wise data structure). It should be noted that these issues are very architecture and problem specific and are best addressed on a case-by-case basis within Python interface. Among the implemented speed/scalability improvement measures is the integer-type encoding and representation of the data in memory, which leads to a smaller footprint, a faster access, and an ability to directly apply a number of efficient numerical operations. The input dataset entries are coded in 0, 1, 2,… format. Further improvements are a result of an optimized algorithm design (section 3.3). Together, these measures increase computational efficiency by approximately an order or two of magnitude (depending on the data and compared with a typical BN software implementation) without imposing restrictions on BN structure. It remains to be mentioned that current limit on the number of samples is flexible and depends on hardware. In a recent application, for example, ∼107 samples (by 100+ variables), dataset analysis (results not shown) was completed in under 24 hours. While the algorithm is, in principle, linear with respect to the number of samples, at some point it becomes a bottleneck (see section 4.2). 3.2. Continuous and discrete variables In baseline implementation, due to the limitations of linear Gaussian (for continuous variables) and hybrid BN local probability models (Friedman et al., 2000; Pe'er, 2005), continuous variables in the dataset have to be discretized. This is customary not just for BN modeling but also for many other data analysis methods. However, there are no commonly agreed upon discretization guidelines or standards in bioinformatics, and existing research tends to be very microarray data-centric (Vass et al., 2011). This paucity can, in large part, be attributed to our limited understanding of the theoretical motivation behind the discretization. Indeed, the actual purpose of discretization is twofold: first, to come up with the partitioning (into bins or events) that best reflects characteristic features of the distribution of a continuous variable, and second, to preserve the intervariable relationships (correlation, dependency) and their relative strengths for subsequent data analysis by the algorithms or statistical tests developed for discrete variables. While these two goals obviously overlap, our primary interest lies with the latter and not the former. For example, in the context of genetic epidemiology, we should aim to discretize the continuous variables, such as quantitative trait (QT) endpoints and intermediate phenotypes, in a manner that maintains true relative strengths of SNP-phenotype associations. Consequently, our considerations are motivated by the fact that partitioning is equivalent to selecting only a few events from the sigma algebra of events associated with the distribution in question, and no selection is any more representative than any other when it comes to establishing conditional independence because it has to be established across all events. Another way of looking at it is while any Borel measurable (reasonable and deterministic) partitioning function is independence-preserving, sensu stricto independence of partitioned variables implies nothing with regard to the independence of original variables. On the other hand, to falsify the independence assumption, it is sufficient to have at least one dependent event, which, however, is not known a priori. Hence, the only reasonably efficient way to approach the problem is from the perspective of falsification of the independence assumption, which is significantly less time-consuming than verifying independence over all possible events. In general, our principal proposition for manual or semimanual discretization is that the simplest discretization method (that satisfies basic common sense data type-specific requirements) should be chosen. While partitioning into the fewer bins potentially might be perceived as leading to increased information loss, this is, in fact, only an illusion generated by poor interpretability of often complicated conditional independence, which does not care about the finer features nearly as much as about the change in the conditional distribution over multiple scales (when conditioned on new variables). What matters the most is how the bins/events of one variable intersect/interact with bins/events of another variable or a set of variables. Moreover, any fears associated with coarser partitioning over the finer partitioning should be counterbalanced by the corresponding decrease in random noise and natural variation customary for the biological variable measurements. At the same time, it has been observed by us and others that partitioning into lower number of bins tends to lead to higher edge density in reconstructed BNs (Clarke and Barton, 2000; Rodin et al., 2005), possibly because BN scoring metrics are biased toward fewer variable values. Regardless, coarser binning can be considered as boosting sensitivity, while finer binning as boosting specificity. Technically, this is due to the fact that the conditional probability distributions with fewer states carry fewer constraints, which are therefore easier to satisfy when it comes to conditional entropy minimization (or conditional probability maximization). Such moderate overfitting, especially in the Markov locality of the continuous variables (e.g., QT phenotype variables in GWAS datasets), is actually beneficial for the purposes of exploratory data analysis (automated hypothesis generation). Typically, discretizing into two or three bins with entropy-based partitioning points (Fayyad and Irani, 1993), as long as it does not explicitly conflict with the observed nature of the continuous variable, is to be preferred. It is also more favorable from the computational and memory utilization efficiency viewpoint. This is the default option in BNOmics (maximum entropy clustering-based discretization into a minimal number of bins). However, there is also a novel option of treating both continuous and discrete variables simultaneously. It is fully functional, but at this time does not scale up as well as purely discrete variable design (see section 4.3). 3.3. BN reconstruction algorithm BN reconstruction is generally a two-stage process, involving model selection (search of the network structure or topology that best fits the data) and probabilistic inference propagation given the fixed network structure. The former is NP-hard and therefore some local or heuristic search algorithm is usually employed (instead of exhaustive search) for any dataset of nontrivial size (starting with ∼20 nodes/variables, exact algorithms and computations become infeasible). The candidate network structures, or models, are evaluated using an objective scoring function (metric). These often incorporate, explicitly or implicitly, a model complexity penalty to prevent overfitting. In addition, an initial search state (network structure prior) has to be specified—selection of an optimal (or biologically meaningful) prior has received much attention (e.g., Friedman et al., 1999; Steele et al., 2009; Keilwagen et al., 2010; Zhang et al., 2014), in part, because it may alleviate the scalability problem to a degree. The default BNOmics algorithm is completely agnostic in regard to the BN structure and prior and therefore is entirely data driven. However, restrictions on the network structure (i.e., forbidding or forcing edges between the network nodes) can easily be accommodated if desired. Once the input data file is initialized, the BNOmics algorithm works as follows: given the data D = {x1,…, xN}, where N is the number of variables, we aim to assign these variables to the N nodes of a BN (Fig. 1). FIG. 1. BN reconstruction algorithm kernel pseudocode (Procedures 1 and 2). BN, Bayesian network. Let I = {1,…,N} be the index set of nodes and πi be the parent set of the i-th node xi. We also introduce a real-valued objective (scoring) function F(xi, πi) that has two parameters, the node and its ancestor set. By changing the ancestor set, we can evaluate different subnetworks centered around the node. The simplest (first-order) operator would be an addition of a node xj to the ancestor set πi of the node xi. By varying j over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ k \in I\it :{x_k} \,\notin \,{ \pi _i}, k \ne i \} $$ \end{document}, we can score each configuration sij = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj). Procedure 1 (Fig. 1) details the process of finding the best single ancestor addition to a node. It returns jmax, the index of a node that (when added as an ancestor) maximizes the scoring function. Thus, it generates the best edge to be added to the network in the immediate Markov neighborhood of xi (in its upper part—the parents of xi), the edge representing a pair of nodes (i,jmax). We call this first-order search as the only allowable operator is adding a single edge. After the addition of an edge, a search for an optimal edge to remove from the ancestor set is performed. If such an edge is found, it is dropped and the necessary changes are propagated through the structure. This combination of add and drop searches allows to maintain a relatively optimal ancestor set (the task of finding an optimal ancestor set is one of the fundamentally difficult problems, and a principal contributor to NP-hardness of network reconstruction, and as such is solved only approximately). Similarly, second-order search allows adding two new edges, evaluating sijk = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xk), and so on. Such higher-order searches are obviously more computationally intensive, and the baseline BNOmics implementation uses first-order search only, although stochastic second-order search has been implemented and tested (second-order search is quadratic O(n2) in the number of variables, and if ∼n pairs of variables are sampled at random from (n2−n)/2 total pairs, the second-order search becomes O(n)). It should be noted that first-order search is still capable of discovering higher-order interactions in the general sense as well as in the more specific biological sense (e.g., epistatic interactions between multiple SNPs in a Markov neighborhood of a phenotype node). Not only that, it is possible to arrive to the same ancestor set using different methods as the only thing that matters from the perspective of the order of accounted for interactions is the ancestor set itself. However, higher-order searches do bring us closer to the exhaustive search ideal (although in practice, they provide only marginal improvement, reflected in finer detail, over other search strategies). Because no network structure (topology) priors are assumed, we start with the empty network; all ancestor sets are initialized to the empty sets, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\leftarrow$$ \end{document} ∅, ∀i\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\,\in I$$ \end{document}. Equivalently, we assume independence of all nodes and then try to falsify this assumption by establishing edges. Then, Procedure 1 is applied to each node to obtain a set of N candidate best edges (i, j) with associated scores (first loop in Procedure 2, Fig. 1). The second loop in Procedure 2 (Fig. 1) selects the highest scoring edge and rearranges the current state of the network accordingly (performs bookkeeping). In effect, Procedure 2 reconstructs the network topology by incremental edge addition (forward selection), creating a partial ordering of the nodes along the way. In that respect, it is an elaborate extension of the basic greedy search (hill-climbing) local search/optimization algorithm. It remains to be noted that different stopping criteria (in addition to the scoring function improvement-based one shown in Fig. 1) can be used; in practice, the second loop of Procedure 2 can be bound by the preset iteration limit (i.e., CPU time) or, for smaller networks, the search can continue until full exhaustion (defined by the largest possible increase of the scoring function being commeasurable with the machine epsilon/precision). In addition to the default gradient descent with constraints, stochastically perturbed restarts are also available as an option (to deal with local minima). Optionally, stochastically perturbed restarts are performed as stochastic relaxation over the structure and subsequent reconvergence until a more optimal structure is found. This algorithm compares favorably with other recent attempts to scale up the BN reconstruction (see section 2 above; also Friedman et al., 1999), in that it does not limit the BN structure flexibility by treating the SNPs, gene expression measurements, or other subgroups of variables separately and testing them for pairwise correlation (implying, for example, linkage disequilibrium or first-order epistatic interaction in case of SNPs) before reconstructing the BN. This said, majority of BN reconstruction algorithms (Neapolitan, 2004), including the one detailed in this communication, reside somewhere in between basic greedy search and exhaustive search (much closer to the former for any sufficiently large datasets), and comprehensive simulation studies are necessary to ascertain their relative performance and robustness within different application domains. The default scoring measure used by BNOmics is MDL/BIC K2 (Cooper and Herskowitz, 1992), a metric directly proportional to the probability of observing the data given the BN structure (i.e., marginal likelihood) that also penalizes explicitly for the model complexity (Schwarz, 1978). Alternatives include AIC (Akaike, 1974) and different flavors of MDL/BIC (Rissanen, 1978; Suzuki, 1999). In brief, AIC and BIC are both MDL with the difference in the complexity (penalty) term, while K2 is BDM (Bayesian Dirichlet Metric) due to the dirichlet distributional assumption, which makes it potentially not as robust as assumption-free MDL. These are realized in BNOmics as C++/ctypes modules. A novel entropy-based scoring function, similar to MDL, but more flexible with respect to the model complexity penalty, is also available, but it has not been extensively tested yet and will be described in detail elsewhere. 3.4. Output format and visualization BNOmics outputs reconstructed BNs in DOT graph description markup language format. Subsequently, DOT files can be edited manually and visualized using Graphviz (open-source graph visualization software, http://www.graphviz.org). Graphviz can apply numerous network layouts to generate publication-quality illustrations. It is also possible to visualize smaller fragments of a BN (e.g., an immediate Markov neighborhood of a certain radius of a node of interest, such as phenotype), which is essential when visualizing BNs reconstructed from large datasets. In parallel, this allows Markov set-based classification (serving as both a variable selection routine and a classifier similar to Naïve Bayes).
Title	3. Algorithm and Implementation
Section	3.1. Data storage and input format The input data file is a plain, flat (variables by observations/individuals) text file in a format similar to the typical comma-delimited spreadsheet export file. Loading from other common file formats, streams, and strings is also supported. Because the basic BN reconstruction algorithm uses multinomial local probability model, in the baseline implementation, discretization of continuous variables is necessary (but see section 3.2). Optional scripts are available for automated input file generation, including common discretization procedures (equal size bins, equal value ranges, entropy-based discretization, etc.). In the context of genetic epidemiology datasets, most variables are discrete by nature (e.g., SNPs, allelic states); however, one should be careful when discretizing continuous phenotypes or, for example, metabolomic measurements. Therefore, if possible, user-driven manual or semimanual discretization is advised (and can be easily accomplished on the fly within the Python environment—it is precisely the flexibility of such nature that led us to choose Python over other languages). Similarly, we advise carrying out user-driven missing value imputation before engaging the BNOmics software—although optional imputation routines (using majority, frequency, and proximity rules) are available, sensible imputation is highly dependent on the specific data type and quality control procedures implemented during the data generation stage. For example, when analyzing metabolomic data, it is difficult to distinguish between the metabolite measurement value missing due to a technical error, low metabolite concentration, or the actual metabolite absence in the sample. Such technical artifacts have to be dealt with manually or semimanually, and with large datasets, the only practical way to do so is to algorithmically parse the data (which, again, is easily achieved by using a Python interpreter as a universal control interface). A more advanced imputation algorithm, based on the local probabilistic inference in the immediate network neighborhood of a variable in question (with missing values), is also available, but its properties have not been comprehensively assessed yet and it will be described in detail elsewhere (A potential concern with such approaches is their general susceptibility to overfitting.). There is no explicit algorithmic or software limit imposed on the input dataset size (number of variables and observations/individuals); however, data files larger than 2–3 GB are not recommended for a typical workstation (16 GB memory or less) installation without certain modifications (to keep it in perspective, 500K variables-strong GWAS dataset containing ∼2000 case/controls fits in just under 1.5GB). To optimize the data storage, retrieval, and memory I/O access in these situations, the actual data (single-type variable values) have to be stored separately from the annotation file, row-wise, following the approach espoused by Nielsen and Mailund (2008) to compensate for disk-bound I/O latency. When dealing with extremely large data files, some form of batch data access may be required, with the whole segments of the file loaded directly into RAM (which explains the need for row-wise data structure). It should be noted that these issues are very architecture and problem specific and are best addressed on a case-by-case basis within Python interface. Among the implemented speed/scalability improvement measures is the integer-type encoding and representation of the data in memory, which leads to a smaller footprint, a faster access, and an ability to directly apply a number of efficient numerical operations. The input dataset entries are coded in 0, 1, 2,… format. Further improvements are a result of an optimized algorithm design (section 3.3). Together, these measures increase computational efficiency by approximately an order or two of magnitude (depending on the data and compared with a typical BN software implementation) without imposing restrictions on BN structure. It remains to be mentioned that current limit on the number of samples is flexible and depends on hardware. In a recent application, for example, ∼107 samples (by 100+ variables), dataset analysis (results not shown) was completed in under 24 hours. While the algorithm is, in principle, linear with respect to the number of samples, at some point it becomes a bottleneck (see section 4.2).
Title	3.1. Data storage and input format
Section	3.2. Continuous and discrete variables In baseline implementation, due to the limitations of linear Gaussian (for continuous variables) and hybrid BN local probability models (Friedman et al., 2000; Pe'er, 2005), continuous variables in the dataset have to be discretized. This is customary not just for BN modeling but also for many other data analysis methods. However, there are no commonly agreed upon discretization guidelines or standards in bioinformatics, and existing research tends to be very microarray data-centric (Vass et al., 2011). This paucity can, in large part, be attributed to our limited understanding of the theoretical motivation behind the discretization. Indeed, the actual purpose of discretization is twofold: first, to come up with the partitioning (into bins or events) that best reflects characteristic features of the distribution of a continuous variable, and second, to preserve the intervariable relationships (correlation, dependency) and their relative strengths for subsequent data analysis by the algorithms or statistical tests developed for discrete variables. While these two goals obviously overlap, our primary interest lies with the latter and not the former. For example, in the context of genetic epidemiology, we should aim to discretize the continuous variables, such as quantitative trait (QT) endpoints and intermediate phenotypes, in a manner that maintains true relative strengths of SNP-phenotype associations. Consequently, our considerations are motivated by the fact that partitioning is equivalent to selecting only a few events from the sigma algebra of events associated with the distribution in question, and no selection is any more representative than any other when it comes to establishing conditional independence because it has to be established across all events. Another way of looking at it is while any Borel measurable (reasonable and deterministic) partitioning function is independence-preserving, sensu stricto independence of partitioned variables implies nothing with regard to the independence of original variables. On the other hand, to falsify the independence assumption, it is sufficient to have at least one dependent event, which, however, is not known a priori. Hence, the only reasonably efficient way to approach the problem is from the perspective of falsification of the independence assumption, which is significantly less time-consuming than verifying independence over all possible events. In general, our principal proposition for manual or semimanual discretization is that the simplest discretization method (that satisfies basic common sense data type-specific requirements) should be chosen. While partitioning into the fewer bins potentially might be perceived as leading to increased information loss, this is, in fact, only an illusion generated by poor interpretability of often complicated conditional independence, which does not care about the finer features nearly as much as about the change in the conditional distribution over multiple scales (when conditioned on new variables). What matters the most is how the bins/events of one variable intersect/interact with bins/events of another variable or a set of variables. Moreover, any fears associated with coarser partitioning over the finer partitioning should be counterbalanced by the corresponding decrease in random noise and natural variation customary for the biological variable measurements. At the same time, it has been observed by us and others that partitioning into lower number of bins tends to lead to higher edge density in reconstructed BNs (Clarke and Barton, 2000; Rodin et al., 2005), possibly because BN scoring metrics are biased toward fewer variable values. Regardless, coarser binning can be considered as boosting sensitivity, while finer binning as boosting specificity. Technically, this is due to the fact that the conditional probability distributions with fewer states carry fewer constraints, which are therefore easier to satisfy when it comes to conditional entropy minimization (or conditional probability maximization). Such moderate overfitting, especially in the Markov locality of the continuous variables (e.g., QT phenotype variables in GWAS datasets), is actually beneficial for the purposes of exploratory data analysis (automated hypothesis generation). Typically, discretizing into two or three bins with entropy-based partitioning points (Fayyad and Irani, 1993), as long as it does not explicitly conflict with the observed nature of the continuous variable, is to be preferred. It is also more favorable from the computational and memory utilization efficiency viewpoint. This is the default option in BNOmics (maximum entropy clustering-based discretization into a minimal number of bins). However, there is also a novel option of treating both continuous and discrete variables simultaneously. It is fully functional, but at this time does not scale up as well as purely discrete variable design (see section 4.3).
Title	3.2. Continuous and discrete variables
Section	3.3. BN reconstruction algorithm BN reconstruction is generally a two-stage process, involving model selection (search of the network structure or topology that best fits the data) and probabilistic inference propagation given the fixed network structure. The former is NP-hard and therefore some local or heuristic search algorithm is usually employed (instead of exhaustive search) for any dataset of nontrivial size (starting with ∼20 nodes/variables, exact algorithms and computations become infeasible). The candidate network structures, or models, are evaluated using an objective scoring function (metric). These often incorporate, explicitly or implicitly, a model complexity penalty to prevent overfitting. In addition, an initial search state (network structure prior) has to be specified—selection of an optimal (or biologically meaningful) prior has received much attention (e.g., Friedman et al., 1999; Steele et al., 2009; Keilwagen et al., 2010; Zhang et al., 2014), in part, because it may alleviate the scalability problem to a degree. The default BNOmics algorithm is completely agnostic in regard to the BN structure and prior and therefore is entirely data driven. However, restrictions on the network structure (i.e., forbidding or forcing edges between the network nodes) can easily be accommodated if desired. Once the input data file is initialized, the BNOmics algorithm works as follows: given the data D = {x1,…, xN}, where N is the number of variables, we aim to assign these variables to the N nodes of a BN (Fig. 1). FIG. 1. BN reconstruction algorithm kernel pseudocode (Procedures 1 and 2). BN, Bayesian network. Let I = {1,…,N} be the index set of nodes and πi be the parent set of the i-th node xi. We also introduce a real-valued objective (scoring) function F(xi, πi) that has two parameters, the node and its ancestor set. By changing the ancestor set, we can evaluate different subnetworks centered around the node. The simplest (first-order) operator would be an addition of a node xj to the ancestor set πi of the node xi. By varying j over \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\{ k \in I\it :{x_k} \,\notin \,{ \pi _i}, k \ne i \} $$ \end{document}, we can score each configuration sij = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj). Procedure 1 (Fig. 1) details the process of finding the best single ancestor addition to a node. It returns jmax, the index of a node that (when added as an ancestor) maximizes the scoring function. Thus, it generates the best edge to be added to the network in the immediate Markov neighborhood of xi (in its upper part—the parents of xi), the edge representing a pair of nodes (i,jmax). We call this first-order search as the only allowable operator is adding a single edge. After the addition of an edge, a search for an optimal edge to remove from the ancestor set is performed. If such an edge is found, it is dropped and the necessary changes are propagated through the structure. This combination of add and drop searches allows to maintain a relatively optimal ancestor set (the task of finding an optimal ancestor set is one of the fundamentally difficult problems, and a principal contributor to NP-hardness of network reconstruction, and as such is solved only approximately). Similarly, second-order search allows adding two new edges, evaluating sijk = F(xi, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xj\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\cup$$ \end{document} xk), and so on. Such higher-order searches are obviously more computationally intensive, and the baseline BNOmics implementation uses first-order search only, although stochastic second-order search has been implemented and tested (second-order search is quadratic O(n2) in the number of variables, and if ∼n pairs of variables are sampled at random from (n2−n)/2 total pairs, the second-order search becomes O(n)). It should be noted that first-order search is still capable of discovering higher-order interactions in the general sense as well as in the more specific biological sense (e.g., epistatic interactions between multiple SNPs in a Markov neighborhood of a phenotype node). Not only that, it is possible to arrive to the same ancestor set using different methods as the only thing that matters from the perspective of the order of accounted for interactions is the ancestor set itself. However, higher-order searches do bring us closer to the exhaustive search ideal (although in practice, they provide only marginal improvement, reflected in finer detail, over other search strategies). Because no network structure (topology) priors are assumed, we start with the empty network; all ancestor sets are initialized to the empty sets, πi\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\leftarrow$$ \end{document} ∅, ∀i\documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} $$\,\in I$$ \end{document}. Equivalently, we assume independence of all nodes and then try to falsify this assumption by establishing edges. Then, Procedure 1 is applied to each node to obtain a set of N candidate best edges (i, j) with associated scores (first loop in Procedure 2, Fig. 1). The second loop in Procedure 2 (Fig. 1) selects the highest scoring edge and rearranges the current state of the network accordingly (performs bookkeeping). In effect, Procedure 2 reconstructs the network topology by incremental edge addition (forward selection), creating a partial ordering of the nodes along the way. In that respect, it is an elaborate extension of the basic greedy search (hill-climbing) local search/optimization algorithm. It remains to be noted that different stopping criteria (in addition to the scoring function improvement-based one shown in Fig. 1) can be used; in practice, the second loop of Procedure 2 can be bound by the preset iteration limit (i.e., CPU time) or, for smaller networks, the search can continue until full exhaustion (defined by the largest possible increase of the scoring function being commeasurable with the machine epsilon/precision). In addition to the default gradient descent with constraints, stochastically perturbed restarts are also available as an option (to deal with local minima). Optionally, stochastically perturbed restarts are performed as stochastic relaxation over the structure and subsequent reconvergence until a more optimal structure is found. This algorithm compares favorably with other recent attempts to scale up the BN reconstruction (see section 2 above; also Friedman et al., 1999), in that it does not limit the BN structure flexibility by treating the SNPs, gene expression measurements, or other subgroups of variables separately and testing them for pairwise correlation (implying, for example, linkage disequilibrium or first-order epistatic interaction in case of SNPs) before reconstructing the BN. This said, majority of BN reconstruction algorithms (Neapolitan, 2004), including the one detailed in this communication, reside somewhere in between basic greedy search and exhaustive search (much closer to the former for any sufficiently large datasets), and comprehensive simulation studies are necessary to ascertain their relative performance and robustness within different application domains. The default scoring measure used by BNOmics is MDL/BIC K2 (Cooper and Herskowitz, 1992), a metric directly proportional to the probability of observing the data given the BN structure (i.e., marginal likelihood) that also penalizes explicitly for the model complexity (Schwarz, 1978). Alternatives include AIC (Akaike, 1974) and different flavors of MDL/BIC (Rissanen, 1978; Suzuki, 1999). In brief, AIC and BIC are both MDL with the difference in the complexity (penalty) term, while K2 is BDM (Bayesian Dirichlet Metric) due to the dirichlet distributional assumption, which makes it potentially not as robust as assumption-free MDL. These are realized in BNOmics as C++/ctypes modules. A novel entropy-based scoring function, similar to MDL, but more flexible with respect to the model complexity penalty, is also available, but it has not been extensively tested yet and will be described in detail elsewhere.
Title	3.3. BN reconstruction algorithm
Figure caption	FIG. 1. BN reconstruction algorithm kernel pseudocode (Procedures 1 and 2). BN, Bayesian network.
Section	3.4. Output format and visualization BNOmics outputs reconstructed BNs in DOT graph description markup language format. Subsequently, DOT files can be edited manually and visualized using Graphviz (open-source graph visualization software, http://www.graphviz.org). Graphviz can apply numerous network layouts to generate publication-quality illustrations. It is also possible to visualize smaller fragments of a BN (e.g., an immediate Markov neighborhood of a certain radius of a node of interest, such as phenotype), which is essential when visualizing BNs reconstructed from large datasets. In parallel, this allows Markov set-based classification (serving as both a variable selection routine and a classifier similar to Naïve Bayes).
Title	3.4. Output format and visualization
Section	4. Results and Discussion 4.1. Example applications Three applications below illustrate various aspects of BN analyses using BNOmics (namely exhaustive BN structure search, scalability, and visualization, and combining heterogeneous omics data and data types). The 2nd and 3rd applications relate to the Atherosclerosis Risk in Communities (ARIC), a comprehensive epidemiologic study of coronary heart disease (CHD), and its risk factors (ARIC investigators, 1989) and are detailed strictly from the analysis methodology prospective, with the biological results to be discussed elsewhere. 4.1.1. Variation in apolipoprotein E gene and plasma lipid and apolipoprotein E levels This application illustrates exhaustive search in smaller BNs. Two datasets are included with the software distribution (available directly from the authors), both in self-explanatory .csv comma-delimited format and with SNP values recoded in 0,1,2,…format. These datasets were generated in a genetic epidemiology study of variation in the apolipoprotein E (APOE) gene and plasma lipid and APOE levels and were described in detail in Nickerson et al. (2000) and Stengård et al. (2002). Briefly, 20 SNPs in the APOE gene were genotyped in 702 African Americans from Jackson, Mississippi, and 854 non-Hispanic whites from Rochester, Minnesota. The datasets also contain plasma lipid and lipoprotein measurements and basic epidemiological variables. Three SNPs (designated #3937, #4075, and #4036 in the datasets) located in the coding region of the gene are associated with various phenotypes of interest. Importantly, SNPs, #3937 and #4075, code for the E2, E3, and E4 APOE isoforms. Plasma APOE level was the primary phenotype of interest in the original study. Therefore, the immediate Markov neighborhood of the BN node associated with the APOE plasma level variable received most scrutiny after the network reconstruction. The reconstructed BNs are shown in Figure 2 (Fig. 2a, African Americans; Fig. 2b, non-Hispanic whites). We refer the reader to Rodin et al. (2005) and Rodin et al. (2012) for the biological interpretation of the networks and will remark instead on the technical aspects of the network reconstruction and visualization. FIG. 2. BNs reconstructed from the APOE datasets. (a) African Americans from Jackson, Mississippi, (b) non-Hispanic whites from Rochester, Minnesota. Numbers next to BN edges indicate edge strengths. See text for interpretation of edge strength and disconnected notes. APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Number nodes indicate corresponding APOE SNPs. APOE, apolipoprotein E; SNP, single-nucleotide polymorphism. The BNs in Figure 2 were reconstructed using MDL scoring metric. The continuous variables were discretized in three bins using entropy-based discretization with MDL stopping criterion. MDL scoring metric imposes a relatively high penalty on the model complexity, resulting in the reconstructed BNs being comparatively sparse (i.e., slightly underfitting). By varying the coefficients in the MDL/AIC metrics, degree of overfitting can be adjusted. The search continued until full exhaustion (bound by the machine epsilon, ∼2−53). The absence of any edges (hanging nodes at the top of Fig. 2a, b) indicates that the corresponding nodes are independent of other variables. A number next to the edge is proportional to the ratio of the model score of the BN with the edge to that of the BN without and quantifies the edge strength (also reflected in the edge color or thickness, which are some of the output options). In both networks, the Markov neighborhoods of the node, APO_E, are consistent with what is known about the biology of ApoE and genetic epidemiology of APOE variation. In fact, the primary reason behind using these two datasets as a benchmark example application throughout the BNOmics development is that the APOE system can serve as a well-established true positive control with effects of known strength. As we change different algorithmic parameters and software settings, it is possible to study the robustness, sensitivity, and specificity of BN reconstruction by comparing the resulting APOE networks. For example, while similar BNs were obtained by us previously using different scoring metric and search engines (Rodin et al., 2005; Rodin et al., 2012), current BNs (Fig. 2) show improvement in robustness due to the exhaustive model selection process. On a related note, we also applied BNOmics to the commonly used benchmark datasets, Alarm (Beinlich et al., 1989), Asia (Lauritzen and Spiegelhalter, 1988), and other well-known small datasets (see http://www.bnlearn.com/for repository), resulting in networks similar or identical to the actual BNs and BNs reconstructed by comparable software packages (results not shown). 4.1.2. GWAS analysis in ARIC study This application demonstrates BNOmics’ scalability in both BN reconstruction and visualization. The ARIC study is a population-based prospective cohort study of CHD and its risk factors (ARIC Investigators, 1989). ARIC recruited 15,792 non-Hispanic white and African American individuals aged 45–64 years at baseline (1987–89), chosen by probability sampling. The phenotypes of interest to the ARIC study include (among many others) CHD endpoint events (CHD deaths, myocardial infarction, and hospitalized congestive heart failure), stroke events, metabolic syndrome, diabetes, blood pressure, and blood lipids. DNA samples (∼900,000 SNPs genotyped using the Affymetrix 6.0 chip) have been collected on all members of the ARIC cohort. When reconstructing BNs from the ARIC GWAS data, we were particularly interested in blood lipid phenotypes, thus concentrating on visualization of the subnetworks containing third-, second-, and first-order (radius) Markov neighborhoods of blood lipid variables (shown in Fig. 3a–c). It should be emphasized that the single reconstructed BN was built from all ∼900,000 SNP variables available—it is not shown for obvious reasons as third-order Markov neighborhood subnetwork is already almost impossible to visualize. Figure 3d shows a subnetwork containing non-SNP variables only. FIG. 3. (a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study. 4.1.3. Metabolomic analysis in ARIC study This example illustrates BNOmics’ ability to deal with heterogeneous (mixed) and unknown data types. One of the major challenges with data of this nature is that most variables are continuous with largely unknown (and different) distributions. Detailed description of metabolomic profile capture in a subset of ARIC study participants can be found in Zheng et al. (2014). Briefly, biochemical profiles of human sera (∼600 metabolites) from ∼2000 ARIC African American individuals were obtained using current Metabolon platform (untargeted mass spectrometry). Approximately 350 known and ∼250 unknown metabolites were measured. For the purposes of this application, primary phenotypes of interest were hypertension and heart failure (common in this subcohort). Analysis goals were to reconstruct the complete BN, to investigate the immediate network neighborhoods of hypertension and heart failure nodes, and to possibly assist in identifying the unknown metabolites; 208 reliably reproducible metabolites were included in the final BN analysis, together with relevant epidemiological variables. The full BN is shown in Figure 4a; Figure 4b depicts immediate (first-order) Markov neighborhood of hypertension phenotype variable. One unknown metabolite (X–11372) was indicated and is currently being followed up. FIG. 4. (a) BN reconstructed from the ARIC metabolomic profile dataset. (b) Visualization of a first-order (radius) Markov neighborhood subnetwork of hypertension phenotype node (HYPERT05). Numbers next to BN edges indicate edge strengths. Epidemiological and known metabolite node designations are largely self-explanatory (e.g., V1AGE01, glycerol). X—<…> nodes indicate unknown metabolites. See Zheng et al., 2014, for more detail. 4.2. Scalability We have successfully applied BNOmics to various large-scale GWAS and metabolomic datasets, scalability being limited only by the hardware (memory) and preset computational time limits. Specifically, constructing a robust (converged) BN from a 900,000 SNP GWAS dataset took about 7 days on a regular 16GB 8-core workstation; a dataset with ∼100,000 variables required ∼27 hours, ∼10,000 variables required ∼10 hours (note that all of the efficiency improvements outlined above in the Data Storage and Input Format section were implemented in a manner specific to each dataset). Therefore, we suggest that potential users experiment with maxing out BNOmics on their respective hardware platforms. We also found scipy weave package (that allows inclusion of C/C++ into Python code) to be potentially useful when dealing with extremely large datasets. At this time, there is only limited provision for parallelization, but an improved version of the algorithm that can take advantage of parallel computing is currently in the works. This becomes especially relevant with the number of samples >106. Currently, coarse multithreading/parallelization at the ancestor search level does not seem to provide significant benefit due to the fact that data passing to threads dominate the process for large sample sizes, and the resulting process tends to be even less efficient than the serial version of the algorithm. A finer granularity level for multithreading/parallelization is necessary. Concurrently, in future, we plan to switch to the contiguous C arrays for improved data storage and retrieval. In general, we intend to recode the BNOmics prototype completely in C and maintain two versions—Python version for BN modeling research and methods testing and C version for the actual data analyses. This said, BNOmics is already more efficient than commonly used general-purpose BN reconstruction packages (e.g., bnlearn, http://www.bnlearn.com/and pebl, http://code.google.com/p/pebl-project/). It should be reemphasized that while BNOmics outputs (in DOT language) a complete BN, for sufficiently large networks, it is impractical to deal with the output in its entirety (indeed, it makes no sense to generate a .pdf file visualizing a network with more than 100–200 nodes). Therefore, users are advised to concentrate on the smaller subnetworks by generating and visualizing lists of parents and children (i.e., immediate ancestors and descendants, respectively) of certain nodes of interest, such as phenotypes (see typical examples in preceding section), or subgraphs defined by the order/radius of relationship (a certain number of generations up and down family tree). In future, we plan to extend this feature by adding the adjacency matrix representation of the ancestor–descendant relationships. 4.3. Future directions We are committed to the continuing open-source flexible code development of BNOmics. While the basic modular organization of the package will remain intact, we plan to introduce major improvements in usability and efficiency in response to the increasing availability of very large-scale datasets (that are presently being generated as part of our ongoing research projects). We are also soliciting ideas and suggestions for improvements from the greater community of potential users, with a special emphasis on (but certainly not limited to) automated format conversion for diverse datasets. We are currently carrying out the analyses of epigenetic (methylome) and immune system datasets, to name just a couple. Another area of future concentration is using BNOmics to vary and compare, in the context of comprehensive simulation studies, different BN reconstruction alternatives with respect to scoring functions/metrics, local search/optimization algorithms, and discretization procedures. The Python code was developed from the very beginning with such simulation studies in mind as various changes in the reconstruction scheme can be incorporated on the fly. In this sense, BNOmics can be used not only as a data analysis tool but also as a platform for the investigation, and improvement, of different aspects of the BN modeling process. Work currently in progress is itemized in Table 1. Table 1. Work on the Algorithm/Software Improvements Currently in Progress 5. C
Title	4. Results and Discussion
Section	4.1. Example applications Three applications below illustrate various aspects of BN analyses using BNOmics (namely exhaustive BN structure search, scalability, and visualization, and combining heterogeneous omics data and data types). The 2nd and 3rd applications relate to the Atherosclerosis Risk in Communities (ARIC), a comprehensive epidemiologic study of coronary heart disease (CHD), and its risk factors (ARIC investigators, 1989) and are detailed strictly from the analysis methodology prospective, with the biological results to be discussed elsewhere. 4.1.1. Variation in apolipoprotein E gene and plasma lipid and apolipoprotein E levels This application illustrates exhaustive search in smaller BNs. Two datasets are included with the software distribution (available directly from the authors), both in self-explanatory .csv comma-delimited format and with SNP values recoded in 0,1,2,…format. These datasets were generated in a genetic epidemiology study of variation in the apolipoprotein E (APOE) gene and plasma lipid and APOE levels and were described in detail in Nickerson et al. (2000) and Stengård et al. (2002). Briefly, 20 SNPs in the APOE gene were genotyped in 702 African Americans from Jackson, Mississippi, and 854 non-Hispanic whites from Rochester, Minnesota. The datasets also contain plasma lipid and lipoprotein measurements and basic epidemiological variables. Three SNPs (designated #3937, #4075, and #4036 in the datasets) located in the coding region of the gene are associated with various phenotypes of interest. Importantly, SNPs, #3937 and #4075, code for the E2, E3, and E4 APOE isoforms. Plasma APOE level was the primary phenotype of interest in the original study. Therefore, the immediate Markov neighborhood of the BN node associated with the APOE plasma level variable received most scrutiny after the network reconstruction. The reconstructed BNs are shown in Figure 2 (Fig. 2a, African Americans; Fig. 2b, non-Hispanic whites). We refer the reader to Rodin et al. (2005) and Rodin et al. (2012) for the biological interpretation of the networks and will remark instead on the technical aspects of the network reconstruction and visualization. FIG. 2. BNs reconstructed from the APOE datasets. (a) African Americans from Jackson, Mississippi, (b) non-Hispanic whites from Rochester, Minnesota. Numbers next to BN edges indicate edge strengths. See text for interpretation of edge strength and disconnected notes. APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Number nodes indicate corresponding APOE SNPs. APOE, apolipoprotein E; SNP, single-nucleotide polymorphism. The BNs in Figure 2 were reconstructed using MDL scoring metric. The continuous variables were discretized in three bins using entropy-based discretization with MDL stopping criterion. MDL scoring metric imposes a relatively high penalty on the model complexity, resulting in the reconstructed BNs being comparatively sparse (i.e., slightly underfitting). By varying the coefficients in the MDL/AIC metrics, degree of overfitting can be adjusted. The search continued until full exhaustion (bound by the machine epsilon, ∼2−53). The absence of any edges (hanging nodes at the top of Fig. 2a, b) indicates that the corresponding nodes are independent of other variables. A number next to the edge is proportional to the ratio of the model score of the BN with the edge to that of the BN without and quantifies the edge strength (also reflected in the edge color or thickness, which are some of the output options). In both networks, the Markov neighborhoods of the node, APO_E, are consistent with what is known about the biology of ApoE and genetic epidemiology of APOE variation. In fact, the primary reason behind using these two datasets as a benchmark example application throughout the BNOmics development is that the APOE system can serve as a well-established true positive control with effects of known strength. As we change different algorithmic parameters and software settings, it is possible to study the robustness, sensitivity, and specificity of BN reconstruction by comparing the resulting APOE networks. For example, while similar BNs were obtained by us previously using different scoring metric and search engines (Rodin et al., 2005; Rodin et al., 2012), current BNs (Fig. 2) show improvement in robustness due to the exhaustive model selection process. On a related note, we also applied BNOmics to the commonly used benchmark datasets, Alarm (Beinlich et al., 1989), Asia (Lauritzen and Spiegelhalter, 1988), and other well-known small datasets (see http://www.bnlearn.com/for repository), resulting in networks similar or identical to the actual BNs and BNs reconstructed by comparable software packages (results not shown). 4.1.2. GWAS analysis in ARIC study This application demonstrates BNOmics’ scalability in both BN reconstruction and visualization. The ARIC study is a population-based prospective cohort study of CHD and its risk factors (ARIC Investigators, 1989). ARIC recruited 15,792 non-Hispanic white and African American individuals aged 45–64 years at baseline (1987–89), chosen by probability sampling. The phenotypes of interest to the ARIC study include (among many others) CHD endpoint events (CHD deaths, myocardial infarction, and hospitalized congestive heart failure), stroke events, metabolic syndrome, diabetes, blood pressure, and blood lipids. DNA samples (∼900,000 SNPs genotyped using the Affymetrix 6.0 chip) have been collected on all members of the ARIC cohort. When reconstructing BNs from the ARIC GWAS data, we were particularly interested in blood lipid phenotypes, thus concentrating on visualization of the subnetworks containing third-, second-, and first-order (radius) Markov neighborhoods of blood lipid variables (shown in Fig. 3a–c). It should be emphasized that the single reconstructed BN was built from all ∼900,000 SNP variables available—it is not shown for obvious reasons as third-order Markov neighborhood subnetwork is already almost impossible to visualize. Figure 3d shows a subnetwork containing non-SNP variables only. FIG. 3. (a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study. 4.1.3. Metabolomic analysis in ARIC study This example illustrates BNOmics’ ability to deal with heterogeneous (mixed) and unknown data types. One of the major challenges with data of this nature is that most variables are continuous with largely unknown (and different) distributions. Detailed description of metabolomic profile capture in a subset of ARIC study participants can be found in Zheng et al. (2014). Briefly, biochemical profiles of human sera (∼600 metabolites) from ∼2000 ARIC African American individuals were obtained using current Metabolon platform (untargeted mass spectrometry). Approximately 350 known and ∼250 unknown metabolites were measured. For the purposes of this application, primary phenotypes of interest were hypertension and heart failure (common in this subcohort). Analysis goals were to reconstruct the complete BN, to investigate the immediate network neighborhoods of hypertension and heart failure nodes, and to possibly assist in identifying the unknown metabolites; 208 reliably reproducible metabolites were included in the final BN analysis, together with relevant epidemiological variables. The full BN is shown in Figure 4a; Figure 4b depicts immediate (first-order) Markov neighborhood of hypertension phenotype variable. One unknown metabolite (X–11372) was indicated and is currently being followed up. FIG. 4. (a) BN reconstructed from the ARIC metabolomic profile dataset. (b) Visualization of a first-order (radius) Markov neighborhood subnetwork of hypertension phenotype node (HYPERT05). Numbers next to BN edges indicate edge strengths. Epidemiological and known metabolite node designations are largely self-explanatory (e.g., V1AGE01, glycerol). X—<…> nodes indicate unknown metabolites. See Zheng et al., 2014, for more detail. 4.
Title	4.1. Example applications
Section	4.1.1. Variation in apolipoprotein E gene and plasma lipid and apolipoprotein E levels This application illustrates exhaustive search in smaller BNs. Two datasets are included with the software distribution (available directly from the authors), both in self-explanatory .csv comma-delimited format and with SNP values recoded in 0,1,2,…format. These datasets were generated in a genetic epidemiology study of variation in the apolipoprotein E (APOE) gene and plasma lipid and APOE levels and were described in detail in Nickerson et al. (2000) and Stengård et al. (2002). Briefly, 20 SNPs in the APOE gene were genotyped in 702 African Americans from Jackson, Mississippi, and 854 non-Hispanic whites from Rochester, Minnesota. The datasets also contain plasma lipid and lipoprotein measurements and basic epidemiological variables. Three SNPs (designated #3937, #4075, and #4036 in the datasets) located in the coding region of the gene are associated with various phenotypes of interest. Importantly, SNPs, #3937 and #4075, code for the E2, E3, and E4 APOE isoforms. Plasma APOE level was the primary phenotype of interest in the original study. Therefore, the immediate Markov neighborhood of the BN node associated with the APOE plasma level variable received most scrutiny after the network reconstruction. The reconstructed BNs are shown in Figure 2 (Fig. 2a, African Americans; Fig. 2b, non-Hispanic whites). We refer the reader to Rodin et al. (2005) and Rodin et al. (2012) for the biological interpretation of the networks and will remark instead on the technical aspects of the network reconstruction and visualization. FIG. 2. BNs reconstructed from the APOE datasets. (a) African Americans from Jackson, Mississippi, (b) non-Hispanic whites from Rochester, Minnesota. Numbers next to BN edges indicate edge strengths. See text for interpretation of edge strength and disconnected notes. APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Number nodes indicate corresponding APOE SNPs. APOE, apolipoprotein E; SNP, single-nucleotide polymorphism. The BNs in Figure 2 were reconstructed using MDL scoring metric. The continuous variables were discretized in three bins using entropy-based discretization with MDL stopping criterion. MDL scoring metric imposes a relatively high penalty on the model complexity, resulting in the reconstructed BNs being comparatively sparse (i.e., slightly underfitting). By varying the coefficients in the MDL/AIC metrics, degree of overfitting can be adjusted. The search continued until full exhaustion (bound by the machine epsilon, ∼2−53). The absence of any edges (hanging nodes at the top of Fig. 2a, b) indicates that the corresponding nodes are independent of other variables. A number next to the edge is proportional to the ratio of the model score of the BN with the edge to that of the BN without and quantifies the edge strength (also reflected in the edge color or thickness, which are some of the output options). In both networks, the Markov neighborhoods of the node, APO_E, are consistent with what is known about the biology of ApoE and genetic epidemiology of APOE variation. In fact, the primary reason behind using these two datasets as a benchmark example application throughout the BNOmics development is that the APOE system can serve as a well-established true positive control with effects of known strength. As we change different algorithmic parameters and software settings, it is possible to study the robustness, sensitivity, and specificity of BN reconstruction by comparing the resulting APOE networks. For example, while similar BNs were obtained by us previously using different scoring metric and search engines (Rodin et al., 2005; Rodin et al., 2012), current BNs (Fig. 2) show improvement in robustness due to the exhaustive model selection process. On a related note, we also applied BNOmics to the commonly used benchmark datasets, Alarm (Beinlich et al., 1989), Asia (Lauritzen and Spiegelhalter, 1988), and other well-known small datasets (see http://www.bnlearn.com/for repository), resulting in networks similar or identical to the actual BNs and BNs reconstructed by comparable software packages (results not shown).
Title	4.1.1. Variation in apolipoprotein E gene and plasma lipid and apolipoprotein E levels
Figure caption	FIG. 2. BNs reconstructed from the APOE datasets. (a) African Americans from Jackson, Mississippi, (b) non-Hispanic whites from Rochester, Minnesota. Numbers next to BN edges indicate edge strengths. See text for interpretation of edge strength and disconnected notes. APO_E, APO_A, APO_B, TRIG, CHOL, and HDL stand for levels of apolipoproteins E, AI, and B, triglycerides, cholesterol, and high-density lipoprotein cholesterol, respectively. Number nodes indicate corresponding APOE SNPs. APOE, apolipoprotein E; SNP, single-nucleotide polymorphism.
Section	4.1.2. GWAS analysis in ARIC study This application demonstrates BNOmics’ scalability in both BN reconstruction and visualization. The ARIC study is a population-based prospective cohort study of CHD and its risk factors (ARIC Investigators, 1989). ARIC recruited 15,792 non-Hispanic white and African American individuals aged 45–64 years at baseline (1987–89), chosen by probability sampling. The phenotypes of interest to the ARIC study include (among many others) CHD endpoint events (CHD deaths, myocardial infarction, and hospitalized congestive heart failure), stroke events, metabolic syndrome, diabetes, blood pressure, and blood lipids. DNA samples (∼900,000 SNPs genotyped using the Affymetrix 6.0 chip) have been collected on all members of the ARIC cohort. When reconstructing BNs from the ARIC GWAS data, we were particularly interested in blood lipid phenotypes, thus concentrating on visualization of the subnetworks containing third-, second-, and first-order (radius) Markov neighborhoods of blood lipid variables (shown in Fig. 3a–c). It should be emphasized that the single reconstructed BN was built from all ∼900,000 SNP variables available—it is not shown for obvious reasons as third-order Markov neighborhood subnetwork is already almost impossible to visualize. Figure 3d shows a subnetwork containing non-SNP variables only. FIG. 3. (a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study. 4.1.
Title	4.1.2. GWAS analysis in ARIC study
Figure caption	FIG. 3. (a–c) Visualization of the subnetworks of a BN reconstructed from the ARIC GWAS dataset. (a) Third-order (radius) Markov neighborhoods of blood lipid and epidemiological variables (nodes 1–8). Other number nodes correspond to the working SNP designations. Such fine scale does not permit for sensible visualization and is for methodology illustration purposes only. (b) Second-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. (c) First-order (radius) Markov neighborhoods of blood lipid and epidemiological variables. Numbers next to BN edges indicate edge strengths. Sex, v1age01, hdl01, totchol, ldl02, trigs, bmi01, and glucos01 stand for gender, age, high-density lipoprotein cholesterol, total cholesterol, low-density lipoprotein cholesterol, triglycerides, BMI, and plasma glucose, respectively. Number nodes indicate corresponding SNPs. (d) BN reconstructed from eight non-SNP variables only, for comparison purposes. GWAS, genome-wide association study. 4.1.
Section	4.1.3. Metabolomic analysis in ARIC study This example illustrates BNOmics’ ability to deal with heterogeneous (mixed) and unknown data types. One of the major challenges with data of this nature is that most variables are continuous with largely unknown (and different) distributions. Detailed description of metabolomic profile capture in a subset of ARIC study participants can be found in Zheng et al. (2014). Briefly, biochemical profiles of human sera (∼600 metabolites) from ∼2000 ARIC African American individuals were obtained using current Metabolon platform (untargeted mass spectrometry). Approximately 350 known and ∼250 unknown metabolites were measured. For the purposes of this application, primary phenotypes of interest were hypertension and heart failure (common in this subcohort). Analysis goals were to reconstruct the complete BN, to investigate the immediate network neighborhoods of hypertension and heart failure nodes, and to possibly assist in identifying the unknown metabolites; 208 reliably reproducible metabolites were included in the final BN analysis, together with relevant epidemiological variables. The full BN is shown in Figure 4a; Figure 4b depicts immediate (first-order) Markov neighborhood of hypertension phenotype variable. One unknown metabolite (X–11372) was indicated and is currently being followed up. FIG. 4. (a) BN reconstructed from the ARIC metabolomic profile dataset. (b) Visualization of a first-order (radius) Markov neighborhood subnetwork of hypertension phenotype node (HYPERT05). Numbers next to BN edges indicate edge strengths. Epidemiological and known metabolite node designations are largely self-explanatory (e.g., V1AGE01, glycerol). X—<…> nodes indicate unknown metabolites. See Zheng et al., 2014, for more detail. 4
Title	4.1.3. Metabolomic analysis in ARIC study
Figure caption	FIG. 4. (a) BN reconstructed from the ARIC metabolomic profile dataset. (b) Visualization of a first-order (radius) Markov neighborhood subnetwork of hypertension phenotype node (HYPERT05). Numbers next to BN edges indicate edge strengths. Epidemiological and known metabolite node designations are largely self-explanatory (e.g., V1AGE01, glycerol). X—<…> nodes indicate unknown metabolites. See Zheng et al., 2014, for more detail. 4
Section	4.2. Scalability We have successfully applied BNOmics to various large-scale GWAS and metabolomic datasets, scalability being limited only by the hardware (memory) and preset computational time limits. Specifically, constructing a robust (converged) BN from a 900,000 SNP GWAS dataset took about 7 days on a regular 16GB 8-core workstation; a dataset with ∼100,000 variables required ∼27 hours, ∼10,000 variables required ∼10 hours (note that all of the efficiency improvements outlined above in the Data Storage and Input Format section were implemented in a manner specific to each dataset). Therefore, we suggest that potential users experiment with maxing out BNOmics on their respective hardware platforms. We also found scipy weave package (that allows inclusion of C/C++ into Python code) to be potentially useful when dealing with extremely large datasets. At this time, there is only limited provision for parallelization, but an improved version of the algorithm that can take advantage of parallel computing is currently in the works. This becomes especially relevant with the number of samples >106. Currently, coarse multithreading/parallelization at the ancestor search level does not seem to provide significant benefit due to the fact that data passing to threads dominate the process for large sample sizes, and the resulting process tends to be even less efficient than the serial version of the algorithm. A finer granularity level for multithreading/parallelization is necessary. Concurrently, in future, we plan to switch to the contiguous C arrays for improved data storage and retrieval. In general, we intend to recode the BNOmics prototype completely in C and maintain two versions—Python version for BN modeling research and methods testing and C version for the actual data analyses. This said, BNOmics is already more efficient than commonly used general-purpose BN reconstruction packages (e.g., bnlearn, http://www.bnlearn.com/and pebl, http://code.google.com/p/pebl-project/). It should be reemphasized that while BNOmics outputs (in DOT language) a complete BN, for sufficiently large networks, it is impractical to deal with the output in its entirety (indeed, it makes no sense to generate a .pdf file visualizing a network with more than 100–200 nodes). Therefore, users are advised to concentrate on the smaller subnetworks by generating and visualizing lists of parents and children (i.e., immediate ancestors and descendants, respectively) of certain nodes of interest, such as phenotypes (see typical examples in preceding section), or subgraphs defined by the order/radius of relationship (a certain number of generations up and down family tree). In future, we plan to extend this feature by adding the adjacency matrix representation of the ancestor–descendant relationships.
Title	4.2. Scalability
Section	4.3. Future directions We are committed to the continuing open-source flexible code development of BNOmics. While the basic modular organization of the package will remain intact, we plan to introduce major improvements in usability and efficiency in response to the increasing availability of very large-scale datasets (that are presently being generated as part of our ongoing research projects). We are also soliciting ideas and suggestions for improvements from the greater community of potential users, with a special emphasis on (but certainly not limited to) automated format conversion for diverse datasets. We are currently carrying out the analyses of epigenetic (methylome) and immune system datasets, to name just a couple. Another area of future concentration is using BNOmics to vary and compare, in the context of comprehensive simulation studies, different BN reconstruction alternatives with respect to scoring functions/metrics, local search/optimization algorithms, and discretization procedures. The Python code was developed from the very beginning with such simulation studies in mind as various changes in the reconstruction scheme can be incorporated on the fly. In this sense, BNOmics can be used not only as a data analysis tool but also as a platform for the investigation, and improvement, of different aspects of the BN modeling process. Work currently in progress is itemized in Table 1. Table 1. Work on the Algorithm/Software Improvements Currently in Progress 5.
Title	4.3. Future directions
Table caption	Table 1. Work on the Algorithm/Software Improvements Currently in Progress 5.
Section	5. Conclusions Three features of BNOmics set it apart from comparable alternatives—first, its high computational efficiency and scalability; second, flexibility and open nature of the source code; and third, its immediate applicability to the large-scale datasets generated by the omics studies. BNOmics has been very useful in our own research projects and collaborations. Currently, there is a substantial interest in applying systems biology thinking and analysis methods to the large-scale omics data (Qi et al., 2014; Agostinho et al., 2015; Sherif et al., 2015; Marini et al., 2015; Yin et al., 2015b; Kaiser et al., 2016). However, the assortment of workable systems biology data analysis tools is very limited especially if the ultimate goal is reverse engineering of biological networks from the massive flat datasets. BN is a useful paradigm for biological network reconstruction and modeling, and BNOmics is a powerful implementation thereof.
Title	5. Conclusions
Section	6. Availability and Requirements Project name: BNOmics Project home page: TBA (City of Hope), available directly from the authors or at Bitbucket (baseline implementation) Operating system(s): Source code is available for any standard Python implementation Programming language: Python, C/C++ Other requirements: Python interpreter, linear algebra libraries (numpy), and optional C/C++ code inclusion libraries (scipy) License: Worldwide nonexclusive, standard open-source
Title	6. Availability and Requirements

Annnotations

blinded

PMC:5372779 JSONTXT 3 Projects

Document structure show

Annnotations

PMC:5372779 JSON TXT 3 Projects