English

not defined

no text concepts found

Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. Markov Blanket based Feature Selection: A Review of Past Decade Shunkai Fu and Michel C. Desmarais thousands, are becoming common. Therefore, feature Abstract—This paper summarizes the related works about selection has been an active research area in pattern feature selection via the induction of Markov blanket which recognition, statistics and data mining communities. The can be traced back to 1996, and the concept of Markov blanket main idea of feature reduction is to select a subset of input itself firstly appeared even earlier in 1988. Our review not only variables by eliminating features with little or no predictive covers a series of published algorithms, including KS, GS, ability, but without scarifying the performance of the model IAMB and its variants, MMPC/MB, HITON-PC/MB, Fast built on the chosen features. It is also known as variable IAMB, PCMB and IPC-MB (ordered as their appearing time), selection, feature reduction, attribute selection or variable but why they were invented and their relative advantage as subset selection. By removing most of the irrelevant and well as disadvantages, from both theoretical and practical redundant features from the data, feature reduction brings viewpoint. Besides, it is noticed that all of these mentioned many potential benefits to us: works are all constraint learning which depends on conditional independence test to induce the target, instead of via score-andsearch, another mainstream manner as applied in the structure Alleviating the effect of the curse of dimensionality to improve prediction performance; Facilitating data visualization and data learning of one closely related concept, Bayesian network. Bing understanding, e.g. which are the important features the first one, we discuss the cause which uncovers that this and how they are related with each other; choice is not accidental, though not in a formal way. The discussion covered here is believed a valuable reference for academic researchers as well as applicants. Index Terms—Feature selection, Markov Blanket. Reducing the measurement and storage requirements; Speeding up the training and inference process; Enhancing model generalization. A principle solution to the feature reduction problem is to determine a subset of features that can render of the rest of whole features independent of the variable of interest [3], I. INTRODUCTION As of 1997, when a special issue (of the journal of Artificial Intelligence) on relevance including several papers on variable and feature selection was published [1], [2], few domains explored used more than 40 features. The situation has changed considerably in the past decade, and currently domains involving many more variables, hundreds to Shunkai Fu is with the Computer Engineering Department, Ecole [4], [5]. From a theoretical perspective, it is known that optimal feature reduction for supervised learning problems requires an exhaustive search of all possible subsets of features of the chosen cardinality, of which the complexity is known as exponential function of the size of whole features. In practice, the target is demoted to a satisfactory set of features instead of an optimal set due to the lack of efficient algorithms. Polytechnique de Montreal, Canada, and the Computer Science and Feature selection algorithms typically fall into two Technology College, Donghua University, China (e-mail: [email protected] categories, feature ranking and subset selection. Feature polymtl.ca) ranking ranks all variables by a metric and eliminates those Michel C.Desmarais is with the Computer Engineering Department, Ecole Polytechnique de Montreal, Canada (e-mail: that do not achieve an adequate score. Selecting the most relevant variables is usually suboptimal for building a [email protected]). ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) WCE 2010 Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. predictor, particularly if the variables are redundant. In other expected to be much more efficient than wrapper words, relevance does not imply optimality [2]. Besides, it approaches. Furthermore, it conquers the defect of has been demonstrated that a variable which is irrelevant by conventional filters, with an output containing only relevant itself can provide a significant performance improvement and useful attributes. Compared with embedded ways, it is when taken with others [2], [6]. obviously a more general choice and could work with all Subset selection, however, evaluates a subset of features learning algorithms. that together have good predictive power, as opposed to Since KS’s work in 1996, there are several attempts to sorting variables according to their individual relevance. make the learning procedure more efficient and effective, Essentially it can be divided into wrappers, filters and including embedded [6]. In the wrapper approach, the feature selection Associative Markov Blanket) and its variants [4, 9], algorithm conducts a search through the space of possible MMPC/MB combination of features and evaluates each subset by Blanket) [11], HITON-PC/MB [12], Fast-IAMB [13], utilizing the learning algorithm of interest as a black box [2]. PCMB (Parent-Children Markov Blanket learning) [5] and Wrappers can be computationally expensive because model IPC-MB (Iterative Parent and Children Markov Blanket training and cross-validation must be repeated over each learning, or with another name BFMB) [14, 15]. GS (Grow-Shrink) (Max-Min Parents [10], IAMB and (Iterative Children/Markov feature subset, and the outcome is tailored to a particular In Section 2, the Markov blanket itself is defined, as well model. Filters are similar to wrappers in the search as its relation with Bayesian Network. Then, in Section 3, approach, but instead of evaluating against a predictor, a we review all major algorithms on learning Markov blanket. simple filter is utilized as preprocessing. Therefore, filters In Section 4, we discuss why all these algorithms are work independent of the chosen predictor. However, filters constraint learning, instead of score-and-search, another have the similar weakness as feature ranking since they primary family of algorithms for inducing Bayesian imply that irrelevant features are useless though it is proved Network. We conclude with a summary about all algorithms not true [2], [6]. Embedded methods perform variable covered in this paper for quick reference. selection in the process of training and are usually specific to given learning algorithms. Compared with wrappers, embedded methods may be more efficient in several respects: they make better use of the available data without II. BAYESIAN NETWORK AND MARKOV BLANKET Bayesian network is a graphical tool that compactly having to split the training data into a training and validation represents a joint probability distribution set; they reach a solution faster by avoiding retraining a random variables predictor from scratch for every variable subset to annotated with conditional probability tables of the investigate [6]. Embedded methods are found in decision probability distribution of a node given any instantiation of trees such as CART, for example, which have a built-in its parents. Therefore, the graph represents qualitative mechanism to perform variable selection [7]. information about the random variables (conditional Koller and Sahami (KS) [3] first showed that the Markov over a set of using a directed acyclic graph (DAG) independence properties), whiles the associated probability is the theoretically distribution, consistent with such properties, provides a ’s value, although quantitative description of how the variables related to each Markov blanket itself is not a new concept and can be traced other. One example of Bayesian network is shown in Fig.1. back to the work of Pearl [8] in 1988. Based on the findings The probability distribution that the full knowledge of is enough to determine network are connected by the Markov condition property: a and that the values of all node is conditionally independent of its non-descendants, blanket (MB) of a given target variable optimal set of features to predict the probability distribution of other variables become superfluous, inducing actually is a procedure of feature selection [3, 4, 9]. From and the graph of a Bayesian given its parents. Definition 1 (Faithfulness). A Bayesian network our point of view, Markov blanket based feature selection joint distribution can be categorized into filters, which means that it works conditional independence entailed by the graph independently of the later processing. Therefore, it is Markov condition is also presented in ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) and a are faithful to one another iff. every and the [8, 16]. WCE 2010 Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. Given the faithfulness assumption, the Markov blanket of graphical view respectively. is unique, and it becomes trivial to retrieve it from the Definition 3 plus Theorem 1 are the topology information corresponding Bayesian network over the problem domain as referred by more recent and finer algorithms such as ’s parents, children and MMPC/MB, HITON-PC/MB, PCMB and IPC-MB. Of spouses (Fig.1). However, this requires the Bayesian course, faithfulness assumption is the basis for all, including network to be ready in advance. Indeed, traditionally, we GS, IAMB and its variants. Lucky enough, the vast majority have to learn the target Bayesian network first to get the of distributions are faithful in the sample limit [18]. . It is known as composed of Markov blanket of some variable, but the structure learning of Bayesian network is known as NP-complete problem. Therefore, an ideal solution will allow us to induce the III. ALGORITHMS FOR LEARNING MARKOV BLANKET (1996 – PRESENT) Markov blanket but without having to have the whole Bayesian network ready first, which, potentially, reduces the time complexity greatly so that we can solve larger scale of problem with the same computing resource. A. KS Pearl is the first one to define the concept and study the property of Markov blanket in his early work on Bayesian network [8]. However, Koller and Sahami’s work towards optimal feature selection is the original one to recognize that Markov blanket of the target of interest is the theoretically optimal set of features to predict its value [3]. Their finding attracted many following trials aiming at inducing the Markov blanket with better performance in the past decade. Koller and Sahami proposed a theoretically justified framework for optimal feature selection based on using cross-entropy to minimizing the amount of predictive Fig.1. An example of a Bayesian network. The gray, while are the variables in additionally includes the texture-filled variable O. information lost during feature elimination [3]. They also proposed one approximate algorithm based on their theoretical model, and this algorithm is referred as KS by Definition 2 Markov Blanket (Probability viewpoint). Given the faithfulness assumption, the Markov blanket of T is a minimal set conditioned on which all other nodes are independent of T, i.e.∀X ∈ U\MB T \ T , I X, T|MB T . Definition 3 Markov Blanket (Graphical viewpoint). Given the faithfulness assumption, the Markov blanket of T is identical to T’s parents, children and children’ parents (spouses), i.e. MB T Pa T ⋃Ch T ⋃Sp T . Theorem 1. If a Bayesian network G is faithful to a joint many since then. KS is the first algorithm for feature selection to employ the concept of Markov blanket. Although it is theoretically sound, the proposed algorithm itself doesn’t guarantee correct outcome. KS algorithm requires two parameters: (1) the number of variables to retain, and (2) the maximum number of variables the algorithm is allowed to condition on. These two limits are helpful to reduce the search complexity greatly, but with a sacrifice of correctness [4], [5]. probability distribution P, then: (1) There is an edge B. GS between the pair of nodes X and Y iff. X and Y are The GS algorithm [10] was proposed in 1999 to induce conditionally dependent given any other set of nodes; (2) for the Bayesian network automatically by first identifying each each triplet of nodes X, Y and Z in G such that X and Y are node’s Markov blanket, then connecting nodes in a adjacent to Z but X is not adjacent to Y, X → Z ← Y is a maximally consistent way. It employs independence subgraph of G iff. X and Y are dependent conditioned on properties of the underling network to discover parts of its every other set of nodes that contains Z [17]. structure, just like the SGS and PC algorithms in [17]. Given the faithfulness assumption, Definition 2 and However, the design of GS enables it to address the two Definition 3 define the Markov blanket from probability and known shortcomings of previous work which are preventing ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) WCE 2010 Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. them from becoming more widespread. These two step, according to their strength of association with disadvantages the empty set. It then admits into candidate are: exponential execution time and given the next proneness to errors in dependence tests used. The former variable in the ordering that is not conditionally independent one is addressed in two ways. One is by identifying the local with neighborhood of each variable in the Bayesian network as a heuristic is that when the pre-processing step in order to facilitate the recovery of the spouses are typically associated with local structure around each variable in polynomial time the empty set and are considered for inclusion in the under the assumption of bounded neighborhood size. The late in the first phase (associations between spouses and second, randomized version goes one step further, are employing a user-specific number of randomized tests in variables, thus they are weaker than those ancestors’ order to ascertain the same result with high probability. The associations with second disadvantage of this research approach, namely positives will enter proneness to errors, is also addressed by the randomized conditional tests of independence will become unreliable version, by using multiple data sets and Bayesian much sooner than when using IAMB’s heuristic. In IAMB, accumulation of evidence. it reorders the set of attributes each time a new attribute given the current only through . One problem with this contains spouses of , the very weakly given confounding/common descendant ). In turn, this implies that more false during the first step and the Like the constraint-based learning algorithm, GS depends enters the blanket in the growing phase based on updated CI on two basic assumptions, faithfulness and correct/reliable testing results, which allows IAMB to perform better than conditional independence (CI) test. Here, the second GS since fewer false positives will be added during the first assumption is required in practice since only when the phase [4, 13]. number of observations is enough, the result of one In spite of the improvement, IAMB is still not data statistical testing would be trustable. Actually, these two efficient since its CI tests may conditioned on the whole assumptions are also the basis of the following algorithms. or even larger set due to its design, though this is not As its name indicates, GS proceeds in two steps, growing necessary as discovered by later work. This point is also greedily first then shrinking by removing false positives. It noticed by its authors, and several variants of IAMB were is the first algorithm proved correct, but it is not efficient proposed, like interIAMB, IAMBnPC and their combined and can’t scale to large scale applications. However, the version interIAMBnPC [9]. InterIAMBnPC employs two soundness of the algorithm makes it a proven subject for methods to reduce the possible size of the conditioning sets: future research. (1) it interleaves the growing phase of IAMB with the In [10], Margaritis and Thrun also proposed one pruning phase attempting to keep the size of as randomized version of GS algorithm to solve problems small as possible during all steps of the algorithm’s involving large amount of variables or variables with many execution; (2) it substitutes the shrinking phase as possible values. It requires manually defined parameter to implemented in IAMB with the PC algorithm instead. reduce the number of conditional tests, similar to KS InterIAMB and IAMBnPC are similar to InterIAMBnPC but algorithm; hence, it cannot guarantee correct output, and it is they only either interleave the first two phases or rely on PC ignored without further discussion. for the backward phase respectively. C. IAMB and Its Variants D. MMPC/MB IAMB was proposed in 2003 for classification problems Although variants of IAMB achieve better performance in microarray research where thousands of attributes are on data efficiency than IAMB, it is still far from quite common. It is an algorithm based on the same two satisfactoriness. Breakthrough was not made till the assumptions of GS, sound in theory and especially simple in introduction implementation. IAMB algorithm is structurally similar to requirement depends on the underlying connectivity as GS, consisting of two phases – growing and shrinking. present in the target graph faithful to the data, instead on the However, there is one important difference: GS orders the size of the Markov blanket as required by previous variables when they are considered for inclusion in the first algorithms. ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) of MMPC/MB in which the sample WCE 2010 Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. The overall MMMB algorithm is composed of two steps. one, but a number of attributes at a time after each Firstly, it depends on MMPC to induce which are directly reordering connected to , i.e. modification . Then it attempts to identify the remaining nodes, i.e. spouses of . The spouses of are the of the of remaining the Markov attributes following blanket. a Fast-IAMB speculatively adds one or more attributes of highest test parents of the common children of , which suggests that significance without resorting after each modification as . So, MMPC is IAMB does, which (hopefully) adds more than one true ’s parents and members of the blanket. Thus, the cost of re-sorting the children, which are viewed as spouse candidates which remaining attributes after each Markov blanket modification contains false ones to be filtered out with further checking. can be amortized over the addition of multiple attributes. they should belong to ∪ applied to each ∈ to induce ∈ is a spouse, we The question arises: how many attributes should be added actually need to recognize the so-called v-structure, i.e. to the blanket in each iteration? The following heuristic is ← . Therefore, the underlying connectivity is used in [13]: dependent attributes are added as long as the critical for us to do the induction, and Theorem 1 tells us conditional independence tests are reliable, i.e. there is how to determine the corresponding connectivity. enough data for conducting them. To determine if → ∈∪ ∈ Although the algorithm MMPC/MB is proved not sound In conclusion, Fast-IAMB realizes a fast induction by by Pena et al. [5], the proposed direction gets recognized by adding greedily as many candidates as possible in the many. HITON-PC/MB, PCMB and IPC-MB all follow the growing phase. similar two-phase framework of MMPC/MB. G. PCMB E. HITON-PC/MB Following the idea of MMPC/MB and HITON-PC/MB, HITON-PC/MB [12] is also the work by the authors of PCMB [5] was also proposed to conquer the data IAMB, and can be viewed as a trial to further make the inefficiency problem of IAMB, and, more importantly, it is induction of Markov blanket more data efficient to meet the the first such trial proved sound theoretically. challenge in practice PCMB requires the same two assumptions as needed by As mentioned by the end of 3.4, HITON-PC/MB works in MMPC/MB and HITON-PC/MB: faithfulness and correct a similar manner as MMPC/MB, with the exception that it statistical test. Similarly, PCMB induces MB via the interleaves the addition and removal of nodes, aiming at recognition of direct connection, i.e. parents and children removing false positives as early as possible so that the about any variable of interest, just like how MMPC/MB and conditioning set is as small as possible. Unfortunately, HITON-PC/MB do, which may explain where its name HITON-PC/MB is also proved not sound in [5]. However, comes from. it is still viewed as another meaningful trial for an efficient PCMB claims to scale to thousands of features as IAMB learning algorithm of Markov blanket without having to does [5], but it is able to achieve much higher accuracy learn the whole Bayesian network. performance than IAMB given the same amount of data [14, 16], which exactly reflects its data efficiency advantage. F. Fast-IAMB Fast-IAMB [13] is the work by the author of GS too. Similar to GS and IAMB, Fast-IAMB contains a growing phase and a shrinking phase. During the growing phase of each iteration, it sorts the attributes that are candidates for admission to from most to least conditionally dependent, according to a heuristic function conditional statistical test). Each such sorting step is potentially expensive since it involves the calculation of the test value between and each member of which contains those left un-checked. The key idea behind Fast- However, when given ENOUGH training data, which means that both algorithms can search as further as they can, PCMB is known as much more time-consuming than IAMB to achieve the same result. Unfortunately, we rarely have such ideal condition in practice, and very often, given limited instances, we have to stop the search due to unreliable statistical tests. In conclusion, what gain by finer algorithm like PCMB is exchanged with more consumption on computing resource, as compared to ̏naïve ̏ one like IAMB. IAMB is to reduce the number of such tests by adding not ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) WCE 2010 Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. H. IPC-MB learning and score-and-search. With constraint-based IPC-MB [15], or BFMB in its first publication version learning, it depends on a series of conditional independence [14], is the most recent progress as published on this topic, (CI) to induce a Bayesian network in agreement with test aiming at even better performance than PCMB. It has results. However, with score-and-search, it defines a global similar framework to MMPC/MB, HITON-PC/MB and measure (or score) which evaluates a given Bayesian PCMB by recognizing firstly those directly connected to , network model as a function of the data. Then, it searches known as candidate parents and children; then, it repeats the the space of possible Bayesian network models with the goal local search given each candidate as found, which not only of finding one with optimal score. In the past years, score- enables us to recognize those false positives, but candidate and-search approach has received more attention due to spouses. Correct spouses are recognized further based on the several known advantages [19]. With the fact that Markov blanket actually is part of the second point of Theorem 1. the target Bayesian network given the faithfulness assumption, connectivity of any pair of variables in a “smarter” manner, is it possible to apply these two mature frameworks in and the overall heuristic as followed by IPC-MB is learning Markov blanket? Even though the score-and-search described as below: approach attracted more attention in the past decade over Compared with PCMB, IPC-MB IPC-MB proceeds by checking and removing false constraint learning to induce Bayesian network structure, it positives. Considering that the size of is is constraint learning that more preferred and experimented , filtering out by researchers on Markov blanket learning. IAMB and its negatives is believed to be much easier a job than variants, GS, MMPC/MB, HITON-PC/MB, PCMB and directly recognizing positives; IPC-MB are all such examples. We believe this is not Recognizing and removing as many, and as early, accidental even though there is no explicit explanation on negatives as possible is an effective way to reduce the choice in published articles, so we are going to share noise and to avoid conditioning on unnecessarily with some to make up this loss, though not in a formal large conditioning set, which is the precondition for manner. normally much smaller than determines reliable CI tests and for the success of learning. Besides, it saves the computing time by prevent needless tests; IPC-MB filters negatives by conditioning on empty set on. Then, one variable is allowed for the conditioning set, and the checking continues on. This procedure iterates with increased conditioning set, resulting with more and more negatives are removed. Given a problem on , we don’t know how many variables belonging to , saying nothing of which ones exactly. With search-and-score approach, we have to ∗ measure all possible subsets ⊑ , i.e. the power set of , | | of which the complexity is 2 , NP-complete. KS algorithm executes in a similar manner, and it requires specifying the target size of . Though specifying the size of So, it is obvious that the decision on a negative is could limit KS’s running in a predictable scale of space, made with as small conditioning set as possible, and obviously it may prevent KS from producing correct results as early as possible as well, which is the most factor since it is impossible to “guess” the exact size of the target for the success of IPC-MB considering that the each time. So, even assuming that we could define a reliability of CI test is the most factor to influence perfect measure (or scoring mechanism), it is not acceptable the performance of such kind of algorithms. in practice if we have to do the search in an exponentially IPC-MB is declared with best trade-off among all expanding space. Therefore, a more affordable as well as published works of this type, in terms of soundness, time sound solution to induce is expected. To achieve efficiency, data efficiency and information found [15, 16]. this goal, some additional guidance is needed to figure out a finer search strategy, i.e. “constraining” the search in a more efficient manner so that as much as possible fruitless effort IV. WHY ALL CONSTRAINT LEARNING Regarding the structure learning of Bayesian network, there are two primary approaches, i.e. constraint-based ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) can be avoided. With the property of Markov blanket, i.e. is independent of any and conditioned on ∉ is dependent on any ∈ conditioned on WCE 2010 Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. \ , we may reduce the search space greatly and get a more efficient algorithm. IAMB is one such important progress, and it indeed excels the previous naïve approach, like KS, with obvious gain in efficiency. Since then, all FastIAMB 2005 effort began to follow the constraint learning approach, proposing one after another algorithm aiming at better performance with more and more heuristics introduced. Of PCMB 2006 course, the procedure is becoming more and more sophisticated, though it is not what we expect. V. CONCLUSION In this paper, we review the published algorithms on IPCMB 2007/ 2008 feature subset selection via the learning of Markov blanket given a target of interest. Those covered in the discussion are listed in Table 1 for quick reference. To the best of our knowledge, IPC-MB achieves the best trade-off as compared with others, in term of effectiveness, time efficiency, data efficiency and topology information efficiency Data efficient compared to IAMB Much slower compared to IAMB Sound in theory No fundamental difference as compared to IAMB Add candidates more greedily to speed up the learning Still poor on data efficiency performance Sound in theory Data efficient by making use of topology information Poor on time efficiency Distinguish spouses from parents/children Distinguish some children from parents/children Sound in theory Most data efficient compared with previous ones Much faster than PCMB on computing Distinguish spouses from parents/children Distinguish some children from parents/children Best trade-off among this family of algorithms inferred. Besides, we also discuss why all these algorithms are categorized as constraint-based learning, though not in a formal manner. Table 1. REFERENCES [1] Examples in Machine Learning. Artificial Intelligence, 1997. 97(1-2). Conclusion on the related algorithms for learning [2] [3] Name Pub. Year Comments KS 1996 1999 IAMB and its variants 2003 MMPC/ MB 2003 HITON PC/MB 2003 Kohavi, R. and G.H. John, Wrappers for Feature Subset Selection. Artificial Intelligence, 1997. 97(1-2). Markov Blanket GS Blum, A. and P. Langley, Selection of Relevant Features and Not sound The first one of this type Requires specifying MB size in advance Sound in theory Proposed to learn Bayesian network via the induction of neighbors of each variable First proved such kind of algorithm Work in two phases: grow and shrink Sound in theory Actually variant of GS Simple to implement Time efficient Very poor on data efficiency IAMB’s variants achieve better performance on data efficiency than IAMB Not sound The first to make use of the underling topology information Much more data efficient compared to IAMB Much slower compared to IAMB Not sound Another trial to make use of the topology information to enhance data ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) Koller, D. and M. Sahami. Toward Optimal Feature Selection. in ICML. 1996. [4] Tsamardinos, I., C.F. Aliferis, and A.R. Statnikov. Time and sample efficient discovery of Markov blankets and direct causal relations. in the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003: ACM. [5] Peña, J.M., et al., Towards scalable and data efficient learning of Markov boundaries. International Journal of Approximate Reasoning 2007. 45(2). [6] Guyon, I. and A. Elisseeff, An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 2003. 3. [7] Breiman, L., et al., Classification and Regression Trees. 1984: Wadsworth. [8] Pearl, J., Probabilistic reasoning in expert systems. 1988, San Matego: [9] Tsamardinos, I., C.F. Aliferis, and A.R. Statnikov. Algorithms for Morgan Kaufmann. Large Scale Markov Blanket Discovery. in the Sixteenth International Florida Artificial Intelligence Research Society Conference. 2003. St. Augustine, Florida, USA: AAAI Press. WCE 2010 Proceedings of the World Congress on Engineering 2010 Vol I WCE 2010, June 30 - July 2, 2010, London, U.K. [10] Margaritis, D. and S. Thrun. Bayesian Network Induction via Local Neighborhoods. in Advances in Neural Information Processing Systems 1999. Denver, Colorado, USA: The MIT Press. [11] Tsamardinos, I., L.E. Brown, and C.F. Aliferis, The max-min hillclimbing Bayesian network structure learning algorithm Machine Learning, 2006. 65(1): p. 31-78. [12] Aliferis, C.F., I. Tsamardinos, and A.R. Statnikov. HITON: A novel Markov blanket algorithm for optimal variable selection. in American Medical Informatics Association Annual Symposium. 2003. [13] Yaramakala, S. and D. Margaritis. Speculative Markov Blanket Discovery for Optimal Feature Selection. in ICDM. 2005. [14] Fu, S.-K. and M.C. Desmarais. Local Learning Algorithm for Markov Blanket Discovery. in Australian Conference on Artificial Intelligence. 2007. Gold Coast, Australia: Springer. [15] Fu, S.-K. and M.C. Desmarais. Fast Markov Blanket Discovery Algorithm Via Local Learning within Single Pass. in Canadian Conference on AI. 2008. Windsor, Canada: Springer. [16] Fu, S.-K. and M.C. Desmarais. Tradeoff Analysis of Different Markov Blanket Local Learning Approaches. in Advances in Knowledge Discovery and Data Mining, 12th Pacific-Asia Conference (PAKDD). 2008. Osaka, Japan: Springer. [17] Spirtes, P., C. Glymour, and R. Scheines, Causation, Prediction and Search (2nd Edition). 2001: The MIT Press. [18] Pearl, J., Causality: Models, Reasoning, and Inference. 2000: Cambridge University Press. [19] Heckerman, D. A Bayesian Approach to Learning Causal Networks. in the Eleventh Annual Conference on Uncertainty in Artificial Intelligence. 1995. Montreal, Quebec, Canada: Morgan Kaufmann. ISBN: 978-988-17012-9-9 ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online) WCE 2010