> top > docs > PMC:5872553 > spans > 1978-2147

PMC:5872553 / 1978-2147 JSONTXT

Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure Abstract Background Many bioinformatics studies aim to identify markers, or features, that can be used to discriminate between distinct groups. In problems where strong individual markers are not available, or where interactions between gene products are of primary interest, it may be necessary to consider combinations of features as a marker family. To this end, recent work proposes a hierarchical Bayesian framework for feature selection that places a prior on the set of features we wish to select and on the label-conditioned feature distribution. While an analytical posterior under Gaussian models with block covariance structures is available, the optimal feature selection algorithm for this model remains intractable since it requires evaluating the posterior over the space of all possible covariance block structures and feature-block assignments. To address this computational barrier, in prior work we proposed a simple suboptimal algorithm, 2MNC-Robust, with robust performance across the space of block structures. Here, we present three new heuristic feature selection algorithms. Results The proposed algorithms outperform 2MNC-Robust and many other popular feature selection algorithms on synthetic data. In addition, enrichment analysis on real breast cancer, colon cancer, and Leukemia data indicates they also output many of the genes and pathways linked to the cancers under study. Conclusions Bayesian feature selection is a promising framework for small-sample high-dimensional data, in particular biomarker discovery applications. When applied to cancer data these algorithms outputted many genes already shown to be involved in cancer as well as potentially new biomarkers. Furthermore, one of the proposed algorithms, SPM, outputs blocks of heavily correlated genes, particularly useful for studying gene interactions and gene networks. Electronic supplementary material The online version of this article (10.1186/s12859-018-2059-8) contains supplementary material, which is available to authorized users. Background Many bioinformatics studies aim to identify predictive biomarkers that can be used to establish diagnosis or prognosis, or to predict a drug response [1–3]. This problem can often be framed as a feature selection task, where the goal is to identify a list of features (molecular biomarkers) that can discriminate between groups of interest based on high-dimensional data from microarray, RNA-seq, or other high-throughput technologies. Initially, exploratory studies are often conducted on small samples to generate a shortlist of biomarker candidates before a large-sample validation study is performed [4]. However, such studies have too often been unsuccessful at producing reliable and reproducible biomarkers [5]. Biomarker discovery is inherently difficult, given the large number of features, highly complex interactions between genes and gene products, enormous variety of dysfunctions that can occur, and many sources of error in the data. As a result, feature selection algorithms are often implemented without much consideration of the particular demands of the problem. For instance, variants of t-test are perhaps the most widely implemented selection strategies in bioinformatics, but can only detect strong individual features, and fail to take correlations into account. Given that molecular signaling is often inherently multivariate, there is a need for methods that can account for correlations and extract combinations of features as a marker family. Wrapper methods do this by ranking sets of features according to some objective function, usually the error of a classifier. However, methods based on classifier error are computationally expensive, and may not necessarily produce the best markers; indeed, strong features can be excluded if they are correlated with other strong features. Furthermore, analysis downstream from feature selection may include gene set enrichment analysis, where the hope is to identify known pathways or other biological mechanisms that contain a statistically significant number of genes in the reported gene set, or may involve the development of new pathways and gene networks. We are thus motivated to develop methods that not only select markers useful for discrimination, but select all relevant markers, even individually weak ones. To address this, in prior work we proposed a hierarchical Bayesian framework for feature selection, labeling features as “good” or “bad”, where good features are those we wish to select, i.e., biomarkers. This framework places a prior on the set of good features and the underlying distribution parameters. Three Gaussian models have been considered. Under independent features, Optimal Bayesian Filtering reports a feature set of a given size with a maximal expected number of truly good features (CMNC-OBF) [6]. Assuming fully dependent good features and independent bad features, 2MNC-DGIB is a fast suboptimal method that ranks features by evaluating all sets of size 2 [7]. Finally, assuming good and bad features are separately dependent, 2MNC-Robust proposes an approximation of the posterior on good features and uses a ranking strategy similar to 2MNC-DGIB to select features [8]. While 2MNC-DGIB has outstanding performance when its assumptions are satisfied [7], it performs poorly when bad features are dependent [9]. On the other hand, CMNC-OBF and 2MNC-Robust have been shown to have robust performance across Bayesian models with block-diagonal covariances [9]. CMNC-OBF is extremely fast and enjoys particularly excellent performance when markers are individually strong with low correlations, but, like all filter methods, may miss weak features that are of interest due to high correlations with strong features [6, 9]. 2MNC-Robust is computationally very manageable and generally improves upon CMNC-OBF in the presence of correlations. Although CMNC-OBF and 2MNC-Robust are robust to different block-diagonal covariance structures, they do not attempt to detect these underlying structures, and their assumptions and approximations constrain performance. Thus, in this work we propose three new feature selection algorithms that: (1) use an iterative strategy to update the approximate posterior used in 2MNC-Robust, (2) use a novel scoring function inspired by Bayes factors to improve overall rankings, and (3) attempt to actually detect the underlying block structure of the data. We show that these algorithms have comparable computation time to 2MNC-Robust, while outperforming 2MNC-Robust and many other popular feature selection algorithms on a synthetic Bayesian model assuming block-diagonal covariance matrices, and a synthetic microarray data model. Finally, we apply the proposed algorithms and CMNC-OBF to breast cancer, colon cancer, and AML datasets, and perform enrichment analysis on each to address validation. Feature selection model We review a hierarchical Bayesian model that serves as a reference for the approximate posterior developed in 2MNC-Robust [8, 9] and will be used in the algorithms we present in the next section. Consider a binary feature selection problem with class labels y=0,1. Let F be the set of feature indices. Assume features are partitioned into blocks, where features in each block are dependent, but features in different blocks are independent. Assume each block is either good or bad. A good block has different class-conditioned distributions between the two classes, while a bad block has the same distribution in both classes. We denote a partitioning of F to good and bad blocks by P=(PG,PB), and hereafter call it a feature partition, where PG={G1,⋯,Gu} is the set of u good blocks and PB={B1,⋯,Bv} is the set of v bad blocks. Furthermore, denote the set of all features in good blocks as good features, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$G=\cup _{i=1}^{u} G_{i}$\end{document}G=∪i=1uGi, and denote all features in bad blocks as bad features, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$B=\cup _{j=1}^{v} B_{j}$\end{document}B=∪j=1vBj. Denote the random feature partition by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {P}=(\bar {P}_{G},\bar {P}_{B})$\end{document}P¯=(P¯G,P¯B), the random set of good features by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {G}$\end{document}G¯, and the random set of bad features by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {B}$\end{document}B¯. We define \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\pi (P)=\mathrm {P}(\bar {P}=P)$\end{document}π(P)=P(P¯=P) to be the prior distribution on \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\bar {P}$\end{document}P¯. Let P be fixed. Let θP be the parameter describing the joint feature distribution of P. Since blocks are independent of each other we can write \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\theta ^{P}\,=\,\left [\!\theta ^{G_{1}}_{0},\cdots,\theta ^{G_{u}}_{0},\theta ^{G_{1}}_{1},\cdots,\theta ^{G_{u}}_{1},\theta ^{B_{1}},\cdots,\theta ^{B_{v}}\right ]$\end{document}θP=θ0G1,⋯,θ0Gu,θ1G1,⋯,θ1Gu,θB1,⋯,θBv, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\theta ^{G_{i}}_{y}$\end{document}θyGi parametrizes class-y features in Gi, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\phantom {\dot {i}\!}\theta ^{B_{j}}$\end{document}θBj parametrizes features in Bj. Assume \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\theta ^{G_{i}}_{y}$\end{document}θyGi and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\phantom {\dot {i}\!}\theta ^{B_{j}}$\end{document}θBj’s are independent given P, i.e., \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\pi \left (\theta ^{P}\right)=\prod _{i=1}^{u} \pi \left (\theta ^{G_{i}}_{0}\right) \pi \left (\theta ^{G_{i}}_{1}\right) \prod _{j=1}^{v} \pi \left (\theta ^{B_{j}}\right)$\end{document}πθP=∏i=1uπθ0Giπθ1Gi∏j=1vπθBj. Given a training set, , of n independent and identically distributed (i.i.d.) points, with ny points in each class, we have and , where π∗(.) denotes posterior, and are class-y points in Gi and points in Bj, respectively, and and are the likelihoods. Following steps in [7, 10], we have 1 In addition, the marginal posterior of a feature set G is , and marginal posterior of a feature f is . Note is different than . Gaussian model Here we solve Eq. (1) for jointly Gaussian features. We assume for a block A, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\theta ^{A}_{y}=\left [\mu ^{A}_{y}, \Sigma ^{A}_{y}\right ]$\end{document}θyA=μyA,ΣyA and θA=[μA,ΣA], where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\mu ^{A}_{y}$\end{document}μyA and μA are the mean vectors, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\Sigma ^{A}_{y}$\end{document}ΣyA and ΣA are the covariance matrices. Let P be a feature partition. Suppose A is a good block of P. Assume \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\pi (\theta ^{A}_{y})$\end{document}π(θyA) is Normal-Inverse-Wishart (NIW). Hence, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\pi \left (\theta ^{A}_{y}\right)=\pi \left (\Sigma ^{A}_{y}\right) \pi \left (\mu ^{A}_{y}|\Sigma ^{A}_{y}\right)$\end{document}πθyA=πΣyAπμyA|ΣyA, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ \begin{aligned} \pi\left(\Sigma^{A}_{y}\right) &= K^{A}_{y} |\Sigma^{A}_{y}|^{-\frac{\kappa^{A}_{y}+|A|+1}{2}} \text{exp}\left(-0.5 \text{Tr}\left(S^{A}_{y} \left(\Sigma^{A}_{y}\right)^{-1}\right) \right), \\ \pi\left(\mu^{A}_{y}|\Sigma^{A}_{y}\right) &= L^{A}_{y} |\Sigma^{A}_{y}|^{-0.5} \\ &\quad\times \text{exp}\left(\!\!-0.5 \nu^{A}_{y} \left(\mu^{A}_{y}\,-\,m^{A}_{y}\right)^{T} \!\left(\!\Sigma^{A}_{y}\right)^{-1}\! \left(\mu^{A}_{y}-m^{A}_{y}\right)\!\! \right)\!, \end{aligned} $$ \end{document}πΣyA=KyA|ΣyA|−κyA+|A|+12exp−0.5TrSyAΣyA−1,πμyA|ΣyA=LyA|ΣyA|−0.5×exp−0.5νyAμyA−myATΣyA−1μyA−myA, where for a matrix |.| denotes determinant. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$S^{A}_{y}, \kappa ^{A}_{y}, m^{A}_{y}$\end{document}SyA,κyA,myA, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\nu ^{A}_{y}$\end{document}νyA are hyperparameters, which are assumed given and fixed. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$S^{A}_{y}$\end{document}SyA is an |A|×|A| matrix, where for a set |.| denotes cardinality. For a proper prior \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$S^{A}_{y}$\end{document}SyA is symmetric and positive-definite, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\kappa ^{A}_{y}>|A|-1$\end{document}κyA>|A|−1. If \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\kappa ^{A}_{y}>|A|+1$\end{document}κyA>|A|+1, then \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$E\left (\Sigma ^{A}_{y}\right)=S^{A}_{y}/\left (\kappa ^{A}_{y}-|{A}|-1\right)$\end{document}EΣyA=SyA/κyA−|A|−1. Furthermore, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$m^{A}_{y}$\end{document}myA is an |A|×1 vector describing the average mean of features and for a proper prior we need \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\nu ^{A}_{y}>0$\end{document}νyA>0. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K^{A}_{y}$\end{document}KyA and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$L^{A}_{y}$\end{document}LyA represent the relative weights of each distribution. For a proper distribution we have \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K^{A}_{y}=|S^{A}_{y}|^{0.5 \kappa ^{A}_{y}}2^{-0.5 \kappa ^{A}_{y} |{A}|} / \Gamma _{|{A}|}\left (0.5 \kappa ^{A}_{y}\right)$\end{document}KyA=|SyA|0.5κyA2−0.5κyA|A|/Γ|A|0.5κyA and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$L^{A}_{y}=\left (2\pi /\nu ^{A}_{y}\right)^{-0.5|{A}|}$\end{document}LyA=2π/νyA−0.5|A|, where Γd denotes the multivariate gamma function. Since NIW is a conjugate prior of Gaussian distribution, given sample, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\pi ^{*}\left (\theta ^{A}_{y}\right)$\end{document}π∗θyA is again NIW with updated hyperparameters: \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\kappa ^{A^{*}}_{y}=\kappa ^{A}_{y}+n_{y}$\end{document}κyA∗=κyA+ny, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\nu ^{A^{*}}_{y}=\nu ^{A}_{y}+n_{y}$\end{document}νyA∗=νyA+ny, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$m^{A^{*}}_{y}=\frac {\nu ^{A}_{y} m^{A}_{y}+ n_{y} \hat {\mu }^{A}_{y}}{\nu ^{A^{*}}_{y}}$\end{document}myA∗=νyAmyA+nyμ^yAνyA∗, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$ {}S^{A^{*}}_{y}\!=S^{A}_{y}+(n_{y}-1) \hat{\Sigma}^{A}_{y}+\frac{\nu^{A}_{y} n_{y}}{\nu^{A}_{y} + n_{y}} \left(\hat{\mu}^{A}_{y}-m^{A}_{y}\right)\left(\hat{\mu}^{A}_{y}-m^{A}_{y}\right)^{T}, $$ \end{document}SyA∗=SyA+(ny−1)Σ^yA+νyAnyνyA+nyμ^yA−myAμ^yA−myAT, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\hat {\mu }^{A}_{y}$\end{document}μ^yA and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\hat {\Sigma }^{A}_{y}$\end{document}Σ^yA are class-conditioned sample mean and covariance of , respectively [11]. Now suppose A is a bad block. We assume the prior on θA is NIW with hyperparameters SA,κA,mA, and νA, and relative weights KA and LA. Given sample, π∗(θA) is NIW with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\kappa ^{A^{*}}=\kappa ^{A}+n$\end{document}κA∗=κA+n, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\nu ^{A^{*}}=\nu ^{A}+n$\end{document}νA∗=νA+n, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$m^{A^{*}}=\frac {\nu ^{A} m^{A}+ n \hat {\mu }^{A}}{\nu ^{A^{*}}}$\end{document}mA∗=νAmA+nμ^AνA∗, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${} S^{A^{*}}=S^{A}+(n-1) \hat{\Sigma}^{A}+\frac{\nu^{A} n}{\nu^{A}+n} \left(\hat{\mu}^{A}-m^{A}\right)\left(\hat{\mu}^{A}-m^{A}\right)^{T}, $$ \end{document}SA∗=SA+(n−1)Σ^A+νAnνA+nμ^A−mAμ^A−mAT, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\hat {\mu }^{A}$\end{document}μ^A and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\hat {\Sigma }^{A}$\end{document}Σ^A are sample mean and covariance of , respectively [11]. As long as π∗(P) is proper, using the normalization constant of NIW distribution to compute the integrals in Eq. (1) we have \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{*{20}l} \pi^{*}(P) &\propto \pi(P) \prod\limits_{i=1}^{u} Q^{G_{i}}_{0} Q^{G_{i}}_{1} \left|S^{G_{i}^{*}}_{0}\right|^{-0.5 \kappa^{G_{i}^{*}}_{0}} \left|S^{G_{i}^{*}}_{1}\right|^{-0.5 \kappa^{G_{i}^{*}}_{1}} \\ &\times \prod\limits_{j=1}^{v} Q^{B_{j}} |S^{B_{j}^{*}}|^{-0.5 \kappa^{B_{j}^{*}}}, \end{array} $$ \end{document}π∗(P)∝π(P)∏i=1uQ0GiQ1GiS0Gi∗−0.5κ0Gi∗S1Gi∗−0.5κ1Gi∗×∏j=1vQBj|SBj∗|−0.5κBj∗, where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{*{20}l} Q^{A}_{y} &=K^{A}_{y} L^{A}_{y} 2^{0.5 \kappa^{A*}_{y} |A|} \Gamma_{|A|}\left(0.5\kappa^{A^{*}}_{y}\right) \left({2 \pi}/{\nu^{A^{*}}_{y}} \right)^{0.5|A|}, \\ Q^{A} &=K^{A} L^{A} 2^{0.5 \kappa^{A*} |A|} \Gamma_{|A|}\left(0.5 \kappa^{A^{*}}\right) \left({2 \pi}/{\nu^{A^{*}}} \right)^{0.5|A|}. \end{array} $$ \end{document}QyA=KyALyA20.5κyA∗|A|Γ|A|0.5κyA∗2π/νyA∗0.5|A|,QA=KALA20.5κA∗|A|Γ|A|0.5κA∗2π/νA∗0.5|A|. Assuming: (1) π(P) is such that the block structure, i.e., the number and size of good and bad blocks, is fixed, (2) for each good block A, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$K^{A}_{y}$\end{document}KyA, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$L^{A}_{y}$\end{document}LyA, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\kappa ^{A}_{y}$\end{document}κyA, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\nu ^{A}_{y}$\end{document}νyA do not depend on the features indices in A, and (3) for each bad block A, KA, LA, κA, and νA do not depend on the features indices in A, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{*{20}l}{} \pi^{*}(P) \propto \pi(P) \left(\prod\limits_{i=1}^{u} \left|S^{G_{i}^{*}}_{0}\right|^{\kappa^{G_{i}^{*}}_{0}} \left|S^{G_{i}^{*}}_{1}\right|^{\kappa^{G_{i}^{*}}_{1}} \prod\limits_{j=1}^{v} \left|S^{B_{j}^{*}}\right|^{\kappa^{B_{j}^{*}}} \right)^{-0.5}. \end{array} $$ \end{document}π∗(P)∝π(P)∏i=1uS0Gi∗κ0Gi∗S1Gi∗κ1Gi∗∏j=1vSBj∗κBj∗−0.5. Methods Here we describe the set selection methods used. Note we aim to find the set of true good features, rather than the true underlying feature partition. The Maximum Number Correct (MNC) criterion [7] outputs the set maximizing the expected number of correctly labeled features and the Constrained MNC (CMNC) criterion outputs the set with maximum expected number of correctly labeled features constrained to having exactly D selected features, where D is a parameter of the optimization problem [9]. The solution of MNC is {f∈F:π∗(f)>0.5} [7] and the solution of CMNC is picking the top D features with largest π∗(f) [9]. Therefore, both MNC and CMNC require computing π∗(f) for all f∈F, which is not computationally feasible for an arbitrary block structure unless |F| is very small. We review two previously proposed algorithms, OBF and 2MNC-Robust, and then present three new algorithms. Optimal Bayesian filter Optimal Bayesian Filter (OBF) assumes all blocks have size one, i.e., all features are independent, and assumes the events \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\{f \in \bar {G}\}$\end{document}{f∈G¯} are independent a priori. In this case π∗(f) can be found in closed form with little computation cost [6, 9]. OBF is optimal under its modeling assumptions. As argued in [9], in the presence of correlation OBF is a robust suboptimal algorithm that can detect individually strong good features, i.e., those whose mean and/or variance is very different between the two classes, but cannot take advantage of correlations to correctly label individually weak good features, those whose mean and variance are similar in both classes. 2MNC-Robust The 2MNC algorithm [7] suggests approximating π∗(f) using π∗(G) for all sets G such that |G|=2, and picking the top D features. Since finding π∗(G) for all feature partitions where |∪PG|=2 is typically infeasible, an approximate posterior, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(G)$\end{document}π~∗(G), is proposed [8], where for all G⊆F of size 2, The normalization constant is found such that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\sum _{G \subseteq F: |G|=2} \tilde {\pi }^{*}(G) = 1$\end{document}∑G⊆F:|G|=2π~∗(G)=1. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }(G)$\end{document}π~(G) mimics the role of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\pi (G)=P(\bar {G}=G)=\sum _{P:\cup P_{G}=G} \pi (P)$\end{document}π(G)=P(G¯=G)=∑P:∪PG=Gπ(P). Using some suboptimal method might affect one’s decision of the value used as the prior of a feature set, replacing π∗(G) with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(G)$\end{document}π~∗(G). For example, knowing \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$|\bar {G}|>2$\end{document}|G¯|>2 implies π(G)=0 for all sets of size 2; however, 2MNC-Robust only evaluates such sets. In this case, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }(G)=P(G \subseteq \bar {G})$\end{document}π~(G)=P(G⊆G¯) might be a suitable choice to replace π(G). \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)=\sum _{G: f \in G} \tilde {\pi }^{*}(G)$\end{document}π~∗(f)=∑G:f∈Gπ~∗(G) is the approximate marginal posterior of f∈F. For the Gaussian model, if the number of good features is fixed and hyperparameters do not depend on the feature indices, 2 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{*{20}l} \tilde{\pi}^{*}(G) &\propto \tilde{\pi}(G) \left({|S^{G^{*}}_{0}|^{\kappa^{G^{*}}_{0}} |S^{G^{*}}_{1}|^{\kappa^{G^{*}}_{1}}}\big/{ |S^{G^{*}}|^{\kappa^{G^{*}}}} \right)^{-0.5}. \end{array} $$ \end{document}π~∗(G)∝π~(G)|S0G∗|κ0G∗|S1G∗|κ1G∗|SG∗|κG∗−0.5. 2MNC-Robust is implementing 2MNC with \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f). As mentioned before, 2MNC-Robust does not tune itself to the underlying block structure of data. Recursive marginal posterior inflation It is easy to show that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\sum _{f\in F} \tilde {\pi }^{*}(f)=2$\end{document}∑f∈Fπ~∗(f)=2 when only sets of size 2 are used to find \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f). Hence, under MNC criterion one would at most pick 4 good features, implying we underestimate π∗(f) by only using sets of size 2 when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$|\bar {G}|>>2$\end{document}|G¯|>>2. REcursive MArginal posterior INflation (REMAIN) aims to sequentially detect good features by rescaling \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)=\sum _{G:f\in G,|G|=2} \tilde {\pi }^{*}(G)$\end{document}π~∗(f)=∑G:f∈G,|G|=2π~∗(G). We initialize REMAIN with the set of all features, Fr=F. Then, REMAIN uses the MArginal posterior INflation (MAIN) algorithm to identify several features as good, removes them from Fr, and feeds MAIN with the truncated Fr to select additional features. This process iterates until MAIN does not output any features. REMAIN is nothing but repetitive calls to MAIN with shrinking feature sets, making MAIN the heart of this algorithm. Pseudo-code of MAIN is provided in Algorithm 1, where H(G) is the right hand side of Eq. (2). Inputted with a feature set Ft, MAIN finds \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f) using sets of size 2, and finds the set \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$G_{s}=\{f \in F_{t} : \tilde {\pi }^{*}(f)>T_{1} \}$\end{document}Gs={f∈Ft:π~∗(f)>T1}. MAIN adds Gs to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {G}$\end{document}G~, the set of features in Ft already labeled as good. It then updates Ft to Ft∖Gs, and rescales \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f) of features f∈Ft so that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\sum _{f \in F_{t}} \tilde {\pi }^{*}(f)=2$\end{document}∑f∈Ftπ~∗(f)=2. Note features in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {G}$\end{document}G~ are used to compute \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f) for features f∈Ft, but \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f) of features \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$f \in \tilde {G}$\end{document}f∈G~ are not used in the scaling of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\sum _{f \in F_{t}} \tilde {\pi }^{*}(f)=2$\end{document}∑f∈Ftπ~∗(f)=2. MAIN iterates until \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {G}=\phi $\end{document}G~=ϕ, or H(G)≤T2 for all G⊆Ft with |G|=2. Not finding new features in MAIN might be due to the remaining good features being weaker and independent of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {G}$\end{document}G~. Hence, REMAIN removes \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {G}$\end{document}G~ from Fr, and feeds MAIN with the updated Fr. This way, features in \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {G}$\end{document}G~ are not used to compute \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f) anymore for any feature f∈Fr, thus making it easier to detect weaker good features that are independent of features already selected by REMAIN. Pseudo-code of REMIAN is provided in Algorithm 2. T1 mimics the role of the threshold used in the MNC criterion. Hence, T1∈[ 0, 1]. Recall that by evaluating sets of size 2 we underestimate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(f)$\end{document}π~∗(f) when \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$|\bar {G}|>>2$\end{document}|G¯|>>2. Therefore, when confident \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$|\bar {G}|>>2$\end{document}|G¯|>>2, one might opt for smaller values for T1 rather than values close to 1. As T2 is a threshold over un-normalized posteriors, H(G), extra care must be taken when setting T2. We suggest T2=n for high-dimensional feature selection applications, which is a good rule of thumb based on our simulation results and asymptotic analysis of H(.). Note the number of features reported by REMAIN is variable; however, one can easily obtain close to a desired number of selected features by tuning T1. To illustrate, we provide an example based on the data generation model used in the “Synthetic microarray simulations” section, where we assume there are 100 markers, i.e., good features, and 4900 non-markers, i.e., bad features. We use the synergetic model with block size k=5 and correlation coefficients ρ0=ρ1=0.9. Fixing T2=n, Table 1 lists the average number of markers and non-markers selected over 1000 iterations for n=20 and 100 across different values of T1. REMAIN outputs very few features when T1 is large, and too many features when T1 is extremely low. The best choice for T1 can vary greatly from case to case, but one strategy is to choose T1 so that REMAIN selects close to a given number of features. For example, T1=0.05 is a good choice in this simulation if one desires approximately 100 selected features. Another strategy is based on the number of features selected by REMAIN across various values of T1. When T1 is large, reducing T1 only slightly increases the output feature size, for instance when T1>0.05 in this simulation. However, one might observe a rapid increase in the output size by slightly reducing T1, for instance T1 changing from 0.05 to 0.01 in this simulation. For such observed patterns, the value for which this phenomenon occurs might be a desirable choice. Table 1 Performance of REMAIN for various values of T1 Posterior factor Feature selection can be construed as a model selection problem where each model is a set of good features. Let f be a feature. If f is a good feature, we expect that if we add f to any model G, i.e., a set of good features, then \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(G \cup \{\,f\}) / \tilde {\pi }^{*}(G)>>1$\end{document}π~∗(G∪{f})/π~∗(G)>>1. If f is a bad feature, we expect \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(G \cup \{\,f\}) / \tilde {\pi }^{*}(G)$\end{document}π~∗(G∪{f})/π~∗(G) to be much smaller. Hence, if we average this ratio over a family of models that do not contain f, and denote it by β(f), we expect β(f)>>1 if f is a good feature, and comparable to or smaller than 1 if f is a bad feature. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }^{*}(G \cup \{\,f\}) / \tilde {\pi }^{*}(G)$\end{document}π~∗(G∪{f})/π~∗(G) is similar, but not identical, to the Bayes factor encountered in model selection [12], where here we compare a model with good feature set G∪{ f} versus a model with good feature set G excluding f. The posterior factor, β(f), averages the approximate posterior ratio across all feature sets G∈F∖{ f}, i.e., 3 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{*{20}l} \beta(f)=\frac{1}{|\Omega^{f}|} \sum_{G \in \Omega^{f}} \frac{\tilde{\pi}^{*}(G \cup \{\,f\})}{\tilde{\pi}^{*}(G)}, \end{array} $$ \end{document}β(f)=1|Ωf|∑G∈Ωfπ~∗(G∪{f})π~∗(G), where Ωf={G⊆F:f∉G}. As the summation over Ωf is computationally infeasible, we propose approximate posterior factor, hereafter denoted by \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\beta }$\end{document}β~. 4 \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $$\begin{array}{*{20}l} \tilde{\beta}(f)=\frac{1}{|F|-1} \sum_{f' \in F \backslash \{f\}} \frac{\tilde{\pi}^{*}(\{\,f,f' \})}{\tilde{\pi}^{*}(\{\,f'\})}. \end{array} $$ \end{document}β~(f)=1|F|−1∑f′∈F∖{f}π~∗({f,f′})π~∗({f′}). We propose approximate POsterior FActor-Constrained (POFAC) algorithm as follows: Use \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\beta }(\,f)$\end{document}β~(f) to rank features, and pick the top D features. Note that D is a parameter of the algorithm. Sequential partition mustering Sequential Partition Mustering (SPM) aims to improve feature selection performance by sequentially detecting good blocks, and adding them to the set of previously selected features. To find a good block, we start with the most significant feature, i.e., the feature with largest \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\beta }$\end{document}β~, and find the block containing it. Note we do not aim to find the structure of bad blocks, and as soon as we declare no more good features remain the algorithm terminates. Suppose u0 is the current most significant feature. In order to find the block containing u0 we propose the Good Seed Grower (GSG) algorithm, which can be construed as a seed growing algorithm with u0 as the seed. Pseudo-code of GSG is presented in Algorithm 3, where for any two non-empty disjoint sets G1,G2⊂F, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document} $${{\begin{aligned} C_{1}(G_{1},G_{2}) &= \frac{\tilde{\pi}(G_{1},G_{2})}{1-\tilde{\pi}(G_{1},G_{2})} \frac{Q^{G_{12}}_{0} Q^{G_{12}}_{1}} {Q^{G_{1}}_{0} Q^{G_{1}}_{1} Q^{G_{2}}_{0} Q^{G_{2}}_{1}} \\ &\quad\times \left(\frac{\left| S^{G^{*}_{12}}_{0}\right|^{\kappa^{G^{*}_{12}}_{0}} \left| S^{G^{*}_{12}}_{1}\right|^{\kappa^{G^{*}_{12}}_{1}} } {\left| S^{G^{*}_{1}}_{0} \right|^{\kappa^{G^{*}_{1}}_{0}} \left| S^{G^{*}_{1}}_{1} \right|^{\kappa^{G^{*}_{1}}_{1}} \left| S^{G^{*}_{2}}_{0} \right|^{\kappa^{G^{*}_{2}}_{0}} \left| S^{G^{*}_{2}}_{1} \right|^{\kappa^{G^{*}_{2}}_{1}} } \right)^{-0.5}, \end{aligned}}} $$ \end{document}C1(G1,G2)=π~(G1,G2)1−π~(G1,G2)Q0G12Q1G12Q0G1Q1G1Q0G2Q1G2×S0G12∗κ0G12∗S1G12∗κ1G12∗S0G1∗κ0G1∗S1G1∗κ1G1∗S0G2∗κ0G2∗S1G2∗κ1G2∗−0.5, G12=G1∪G2, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\pi }(G_{1},G_{2})$\end{document}π~(G1,G2) approximates π(G1,G2), the prior probability that at least one of the features in G2 is not independent of G1. Note , where is the family of feature partitions that contain a block U such that U∩G1≠ϕ and U∩G2≠ϕ. At each iteration, GSG finds the feature u∗ that maximizes C1(U,{u∗}), where U is the currently detected sub-block of the block containing u0. GSG declares u∗ and U belong to the same block if C1(U,{u∗})>T3, and adjoins u∗ to U; otherwise, it terminates and declares U as the block containing u0. Here we assume \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\phantom {\dot {i}\!}T_{3}=t_{1} n^{t_{2} |U|}$\end{document}T3=t1nt2|U|, where t1,t2>0 are parameters of GSG. While we have only considered one possible family of thresholds, we expect this family to be large enough for most practical purposes. Pseudo-code of SPM is explained in Algorithm 4. Let Ft be the feature set used by SPM initialized to Ft=F. We start with the most significant feature u0 and find the block containing it, U. We then update Ft to Ft∖U. If \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$\tilde {\beta }(f)

Document structure show

projects that have annotations to this span

There is no project