Markov Blanket based Feature Selection: A Review of Past Decade

Document technical information

Format pdf
Size 673.4 kB
First found May 22, 2018

Document content analysis

Category Also themed
Language
English
Type
not defined
Concepts
no text concepts found

Persons

Gustave Eiffel
Gustave Eiffel

wikipedia, lookup

Organizations

Places

Transcript

Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
Markov Blanket based Feature Selection: A
Review of Past Decade
Shunkai Fu and Michel C. Desmarais

thousands, are becoming common. Therefore, feature
Abstract—This paper summarizes the related works about
selection has been an active research area in pattern
feature selection via the induction of Markov blanket which
recognition, statistics and data mining communities. The
can be traced back to 1996, and the concept of Markov blanket
main idea of feature reduction is to select a subset of input
itself firstly appeared even earlier in 1988. Our review not only
variables by eliminating features with little or no predictive
covers a series of published algorithms, including KS, GS,
ability, but without scarifying the performance of the model
IAMB and its variants, MMPC/MB, HITON-PC/MB, Fast
built on the chosen features. It is also known as variable
IAMB, PCMB and IPC-MB (ordered as their appearing time),
selection, feature reduction, attribute selection or variable
but why they were invented and their relative advantage as
subset selection. By removing most of the irrelevant and
well as disadvantages, from both theoretical and practical
redundant features from the data, feature reduction brings
viewpoint. Besides, it is noticed that all of these mentioned
many potential benefits to us:
works are all constraint learning which depends on conditional

independence test to induce the target, instead of via score-andsearch, another mainstream manner as applied in the structure
Alleviating the effect of the curse of dimensionality
to improve prediction performance;

Facilitating
data
visualization
and
data
learning of one closely related concept, Bayesian network. Bing
understanding, e.g. which are the important features
the first one, we discuss the cause which uncovers that this
and how they are related with each other;
choice is not accidental, though not in a formal way. The

discussion covered here is believed a valuable reference for
academic researchers as well as applicants.
Index Terms—Feature selection, Markov Blanket.
Reducing
the
measurement
and
storage
requirements;

Speeding up the training and inference process;

Enhancing model generalization.
A principle solution to the feature reduction problem is to
determine a subset of features that can render of the rest of
whole features independent of the variable of interest [3],
I. INTRODUCTION
As of 1997, when a special issue (of the journal of Artificial
Intelligence) on relevance including several papers on
variable and feature selection was published [1], [2], few
domains explored used more than 40 features. The situation
has changed considerably in the past decade, and currently
domains involving many more variables, hundreds to
Shunkai Fu is with the Computer Engineering Department, Ecole
[4], [5]. From a theoretical perspective, it is known that
optimal feature reduction for supervised learning problems
requires an exhaustive search of all possible subsets of
features of the chosen cardinality, of which the complexity
is known as exponential function of the size of whole
features. In practice, the target is demoted to a satisfactory
set of features instead of an optimal set due to the lack of
efficient algorithms.
Polytechnique de Montreal, Canada, and the Computer Science and
Feature selection algorithms typically fall into two
Technology College, Donghua University, China (e-mail: [email protected]
categories, feature ranking and subset selection. Feature
polymtl.ca)
ranking ranks all variables by a metric and eliminates those
Michel C.Desmarais is with the Computer Engineering Department,
Ecole
Polytechnique
de
Montreal,
Canada
(e-mail:
that do not achieve an adequate score. Selecting the most
relevant variables is usually suboptimal for building a
[email protected]).
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
WCE 2010
Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
predictor, particularly if the variables are redundant. In other
expected to be much more efficient than wrapper
words, relevance does not imply optimality [2]. Besides, it
approaches. Furthermore, it conquers the defect of
has been demonstrated that a variable which is irrelevant by
conventional filters, with an output containing only relevant
itself can provide a significant performance improvement
and useful attributes. Compared with embedded ways, it is
when taken with others [2], [6].
obviously a more general choice and could work with all
Subset selection, however, evaluates a subset of features
learning algorithms.
that together have good predictive power, as opposed to
Since KS’s work in 1996, there are several attempts to
sorting variables according to their individual relevance.
make the learning procedure more efficient and effective,
Essentially it can be divided into wrappers, filters and
including
embedded [6]. In the wrapper approach, the feature selection
Associative Markov Blanket) and its variants [4, 9],
algorithm conducts a search through the space of possible
MMPC/MB
combination of features and evaluates each subset by
Blanket) [11], HITON-PC/MB [12], Fast-IAMB [13],
utilizing the learning algorithm of interest as a black box [2].
PCMB (Parent-Children Markov Blanket learning) [5] and
Wrappers can be computationally expensive because model
IPC-MB (Iterative Parent and Children Markov Blanket
training and cross-validation must be repeated over each
learning, or with another name BFMB) [14, 15].
GS
(Grow-Shrink)
(Max-Min
Parents
[10],
IAMB
and
(Iterative
Children/Markov
feature subset, and the outcome is tailored to a particular
In Section 2, the Markov blanket itself is defined, as well
model. Filters are similar to wrappers in the search
as its relation with Bayesian Network. Then, in Section 3,
approach, but instead of evaluating against a predictor, a
we review all major algorithms on learning Markov blanket.
simple filter is utilized as preprocessing. Therefore, filters
In Section 4, we discuss why all these algorithms are
work independent of the chosen predictor. However, filters
constraint learning, instead of score-and-search, another
have the similar weakness as feature ranking since they
primary family of algorithms for inducing Bayesian
imply that irrelevant features are useless though it is proved
Network. We conclude with a summary about all algorithms
not true [2], [6]. Embedded methods perform variable
covered in this paper for quick reference.
selection in the process of training and are usually specific
to given learning algorithms. Compared with wrappers,
embedded methods may be more efficient in several
respects: they make better use of the available data without
II. BAYESIAN NETWORK AND MARKOV BLANKET
Bayesian network is a graphical tool that compactly
having to split the training data into a training and validation
represents a joint probability distribution
set; they reach a solution faster by avoiding retraining a
random variables
predictor from scratch for every variable subset to
annotated with conditional probability tables of the
investigate [6]. Embedded methods are found in decision
probability distribution of a node given any instantiation of
trees such as CART, for example, which have a built-in
its parents. Therefore, the graph represents qualitative
mechanism to perform variable selection [7].
information about the random variables (conditional
Koller and Sahami (KS) [3] first showed that the Markov
over a set of
using a directed acyclic graph (DAG)
independence properties), whiles the associated probability
is the theoretically
distribution, consistent with such properties, provides a
’s value, although
quantitative description of how the variables related to each
Markov blanket itself is not a new concept and can be traced
other. One example of Bayesian network is shown in Fig.1.
back to the work of Pearl [8] in 1988. Based on the findings
The probability distribution
that the full knowledge of
is enough to determine
network are connected by the Markov condition property: a
and that the values of all
node is conditionally independent of its non-descendants,
blanket (MB) of a given target variable
optimal set of features to predict
the probability distribution of
other variables become superfluous, inducing
actually is a procedure of feature selection [3, 4, 9]. From
and the graph
of a Bayesian
given its parents.
Definition 1 (Faithfulness). A Bayesian network
our point of view, Markov blanket based feature selection
joint distribution
can be categorized into filters, which means that it works
conditional independence entailed by the graph
independently of the later processing. Therefore, it is
Markov condition is also presented in
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
and a
are faithful to one another iff. every
and the
[8, 16].
WCE 2010
Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
Given the faithfulness assumption, the Markov blanket of
graphical view respectively.
is unique, and it becomes trivial to retrieve it from the
Definition 3 plus Theorem 1 are the topology information
corresponding Bayesian network over the problem domain
as referred by more recent and finer algorithms such as
’s parents, children and
MMPC/MB, HITON-PC/MB, PCMB and IPC-MB. Of
spouses (Fig.1). However, this requires the Bayesian
course, faithfulness assumption is the basis for all, including
network to be ready in advance. Indeed, traditionally, we
GS, IAMB and its variants. Lucky enough, the vast majority
have to learn the target Bayesian network first to get the
of distributions are faithful in the sample limit [18].
. It is known as composed of
Markov blanket of some variable, but the structure learning
of Bayesian network is known as NP-complete problem.
Therefore, an ideal solution will allow us to induce the
III. ALGORITHMS FOR LEARNING MARKOV BLANKET (1996
– PRESENT)
Markov blanket but without having to have the whole
Bayesian network ready first, which, potentially, reduces the
time complexity greatly so that we can solve larger scale of
problem with the same computing resource.
A. KS
Pearl is the first one to define the concept and study the
property of Markov blanket in his early work on Bayesian
network [8]. However, Koller and Sahami’s work towards
optimal feature selection is the original one to recognize that
Markov blanket of the target of interest is the theoretically
optimal set of features to predict its value [3]. Their finding
attracted many following trials aiming at inducing the
Markov blanket with better performance in the past decade.
Koller and Sahami proposed a theoretically justified
framework for optimal feature selection based on using
cross-entropy to minimizing the amount of predictive
Fig.1. An example of a Bayesian network. The
gray, while
are the variables in
additionally includes the texture-filled variable O.
information lost during feature elimination [3]. They also
proposed one approximate algorithm based on their
theoretical model, and this algorithm is referred as KS by
Definition 2 Markov Blanket (Probability viewpoint).
Given the faithfulness assumption, the Markov blanket of T
is a minimal set conditioned on which all other nodes are
independent of T, i.e.∀X ∈ U\MB T \ T , I X, T|MB T .
Definition 3 Markov Blanket (Graphical viewpoint).
Given the faithfulness assumption, the Markov blanket of T
is identical to T’s parents, children and children’ parents
(spouses), i.e. MB T
Pa T ⋃Ch T ⋃Sp T .
Theorem 1. If a Bayesian network G is faithful to a joint
many since then. KS is the first algorithm for feature
selection to employ the concept of Markov blanket.
Although it is theoretically sound, the proposed algorithm
itself doesn’t guarantee correct outcome. KS algorithm
requires two parameters: (1) the number of variables to
retain, and (2) the maximum number of variables the
algorithm is allowed to condition on. These two limits are
helpful to reduce the search complexity greatly, but with a
sacrifice of correctness [4], [5].
probability distribution P, then: (1) There is an edge
B. GS
between the pair of nodes X
and Y iff. X and Y are
The GS algorithm [10] was proposed in 1999 to induce
conditionally dependent given any other set of nodes; (2) for
the Bayesian network automatically by first identifying each
each triplet of nodes X, Y and Z in G such that X and Y are
node’s Markov blanket, then connecting nodes in a
adjacent to Z but X is not adjacent to Y, X → Z ← Y is a
maximally consistent way. It employs independence
subgraph of G iff. X and Y are dependent conditioned on
properties of the underling network to discover parts of its
every other set of nodes that contains Z [17].
structure, just like the SGS and PC algorithms in [17].
Given the faithfulness assumption, Definition 2 and
However, the design of GS enables it to address the two
Definition 3 define the Markov blanket from probability and
known shortcomings of previous work which are preventing
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
WCE 2010
Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
them from becoming more widespread. These two
step, according to their strength of association with
disadvantages
the empty set. It then admits into candidate
are:
exponential
execution
time
and
given
the next
proneness to errors in dependence tests used. The former
variable in the ordering that is not conditionally independent
one is addressed in two ways. One is by identifying the local
with
neighborhood of each variable in the Bayesian network as a
heuristic is that when the
pre-processing step in order to facilitate the recovery of the
spouses are typically associated with
local structure around each variable in polynomial time
the empty set and are considered for inclusion in the
under the assumption of bounded neighborhood size. The
late in the first phase (associations between spouses and
second, randomized version goes one step further,
are
employing a user-specific number of randomized tests in
variables, thus they are weaker than those ancestors’
order to ascertain the same result with high probability. The
associations with
second disadvantage of this research approach, namely
positives will enter
proneness to errors, is also addressed by the randomized
conditional tests of independence will become unreliable
version, by using multiple data sets and Bayesian
much sooner than when using IAMB’s heuristic. In IAMB,
accumulation of evidence.
it reorders the set of attributes each time a new attribute
given the current
only
through
. One problem with this
contains spouses of , the
very weakly given
confounding/common
descendant
). In turn, this implies that more false
during the first step and the
Like the constraint-based learning algorithm, GS depends
enters the blanket in the growing phase based on updated CI
on two basic assumptions, faithfulness and correct/reliable
testing results, which allows IAMB to perform better than
conditional independence (CI) test. Here, the second
GS since fewer false positives will be added during the first
assumption is required in practice since only when the
phase [4, 13].
number of observations is enough, the result of one
In spite of the improvement, IAMB is still not data
statistical testing would be trustable. Actually, these two
efficient since its CI tests may conditioned on the whole
assumptions are also the basis of the following algorithms.
or even larger set due to its design, though this is not
As its name indicates, GS proceeds in two steps, growing
necessary as discovered by later work. This point is also
greedily first then shrinking by removing false positives. It
noticed by its authors, and several variants of IAMB were
is the first algorithm proved correct, but it is not efficient
proposed, like interIAMB, IAMBnPC and their combined
and can’t scale to large scale applications. However, the
version interIAMBnPC [9]. InterIAMBnPC employs two
soundness of the algorithm makes it a proven subject for
methods to reduce the possible size of the conditioning sets:
future research.
(1) it interleaves the growing phase of IAMB with the
In [10], Margaritis and Thrun also proposed one
pruning phase attempting to keep the size of
as
randomized version of GS algorithm to solve problems
small as possible during all steps of the algorithm’s
involving large amount of variables or variables with many
execution; (2) it substitutes the shrinking phase as
possible values. It requires manually defined parameter to
implemented in IAMB with the PC algorithm instead.
reduce the number of conditional tests, similar to KS
InterIAMB and IAMBnPC are similar to InterIAMBnPC but
algorithm; hence, it cannot guarantee correct output, and it is
they only either interleave the first two phases or rely on PC
ignored without further discussion.
for the backward phase respectively.
C. IAMB and Its Variants
D. MMPC/MB
IAMB was proposed in 2003 for classification problems
Although variants of IAMB achieve better performance
in microarray research where thousands of attributes are
on data efficiency than IAMB, it is still far from
quite common. It is an algorithm based on the same two
satisfactoriness. Breakthrough was not made till the
assumptions of GS, sound in theory and especially simple in
introduction
implementation. IAMB algorithm is structurally similar to
requirement depends on the underlying connectivity as
GS, consisting of two phases – growing and shrinking.
present in the target graph faithful to the data, instead on the
However, there is one important difference: GS orders the
size of the Markov blanket as required by previous
variables when they are considered for inclusion in the first
algorithms.
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
of
MMPC/MB
in
which
the
sample
WCE 2010
Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
The overall MMMB algorithm is composed of two steps.
one, but a number of attributes at a time after each
Firstly, it depends on MMPC to induce which are directly
reordering
connected to , i.e.
modification
. Then it attempts to identify the
remaining nodes, i.e. spouses of . The spouses of
are the
of
the
of
remaining
the
Markov
attributes following
blanket.
a
Fast-IAMB
speculatively adds one or more attributes of highest
test
parents of the common children of , which suggests that
significance without resorting after each modification as
. So, MMPC is
IAMB does, which (hopefully) adds more than one true
’s parents and
members of the blanket. Thus, the cost of re-sorting the
children, which are viewed as spouse candidates which
remaining attributes after each Markov blanket modification
contains false ones to be filtered out with further checking.
can be amortized over the addition of multiple attributes.
they should belong to ∪
applied to each
∈
to induce
∈
is a spouse, we
The question arises: how many attributes should be added
actually need to recognize the so-called v-structure, i.e.
to the blanket in each iteration? The following heuristic is
← . Therefore, the underlying connectivity is
used in [13]: dependent attributes are added as long as the
critical for us to do the induction, and Theorem 1 tells us
conditional independence tests are reliable, i.e. there is
how to determine the corresponding connectivity.
enough data for conducting them.
To determine if
→
∈∪
∈
Although the algorithm MMPC/MB is proved not sound
In conclusion, Fast-IAMB realizes a fast induction by
by Pena et al. [5], the proposed direction gets recognized by
adding greedily as many candidates as possible in the
many. HITON-PC/MB, PCMB and IPC-MB all follow the
growing phase.
similar two-phase framework of MMPC/MB.
G. PCMB
E. HITON-PC/MB
Following the idea of MMPC/MB and HITON-PC/MB,
HITON-PC/MB [12] is also the work by the authors of
PCMB [5] was also proposed to conquer the data
IAMB, and can be viewed as a trial to further make the
inefficiency problem of IAMB, and, more importantly, it is
induction of Markov blanket more data efficient to meet the
the first such trial proved sound theoretically.
challenge in practice
PCMB requires the same two assumptions as needed by
As mentioned by the end of 3.4, HITON-PC/MB works in
MMPC/MB and HITON-PC/MB: faithfulness and correct
a similar manner as MMPC/MB, with the exception that it
statistical test. Similarly, PCMB induces MB via the
interleaves the addition and removal of nodes, aiming at
recognition of direct connection, i.e. parents and children
removing false positives as early as possible so that the
about any variable of interest, just like how MMPC/MB and
conditioning set is as small as possible. Unfortunately,
HITON-PC/MB do, which may explain where its name
HITON-PC/MB is also proved not sound in [5]. However,
comes from.
it is still viewed as another meaningful trial for an efficient
PCMB claims to scale to thousands of features as IAMB
learning algorithm of Markov blanket without having to
does [5], but it is able to achieve much higher accuracy
learn the whole Bayesian network.
performance than IAMB given the same amount of data [14,
16], which exactly reflects its data efficiency advantage.
F. Fast-IAMB
Fast-IAMB [13] is the work by the author of GS too.
Similar to GS and IAMB, Fast-IAMB contains a growing
phase and a shrinking phase. During the growing phase of
each iteration, it sorts the attributes that are candidates for
admission to
from most to least conditionally
dependent, according to a heuristic function
conditional statistical test). Each such sorting step is
potentially expensive since it involves the calculation of the
test value between
and each member of
which
contains those left un-checked. The key idea behind Fast-
However, when given ENOUGH training data, which means
that both algorithms can search as further as they can,
PCMB is known as much more time-consuming than IAMB
to achieve the same result. Unfortunately, we rarely have
such ideal condition in practice, and very often, given
limited instances, we have to stop the search due to
unreliable statistical tests. In conclusion, what gain by finer
algorithm like PCMB is exchanged with more consumption
on computing resource, as compared to ̏naïve ̏ one like
IAMB.
IAMB is to reduce the number of such tests by adding not
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
WCE 2010
Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
H. IPC-MB
learning
and
score-and-search.
With
constraint-based
IPC-MB [15], or BFMB in its first publication version
learning, it depends on a series of conditional independence
[14], is the most recent progress as published on this topic,
(CI) to induce a Bayesian network in agreement with test
aiming at even better performance than PCMB. It has
results. However, with score-and-search, it defines a global
similar framework to MMPC/MB, HITON-PC/MB and
measure (or score) which evaluates a given Bayesian
PCMB by recognizing firstly those directly connected to ,
network model as a function of the data. Then, it searches
known as candidate parents and children; then, it repeats the
the space of possible Bayesian network models with the goal
local search given each candidate as found, which not only
of finding one with optimal score. In the past years, score-
enables us to recognize those false positives, but candidate
and-search approach has received more attention due to
spouses. Correct spouses are recognized further based on the
several known advantages [19].
With the fact that Markov blanket actually is part of the
second point of Theorem 1.
the
target Bayesian network given the faithfulness assumption,
connectivity of any pair of variables in a “smarter” manner,
is it possible to apply these two mature frameworks in
and the overall heuristic as followed by IPC-MB is
learning Markov blanket? Even though the score-and-search
described as below:
approach attracted more attention in the past decade over
Compared

with
PCMB,
IPC-MB
IPC-MB proceeds by checking and removing false
constraint learning to induce Bayesian network structure, it
positives. Considering that the size of
is
is constraint learning that more preferred and experimented
, filtering out
by researchers on Markov blanket learning. IAMB and its
negatives is believed to be much easier a job than
variants, GS, MMPC/MB, HITON-PC/MB, PCMB and
directly recognizing positives;
IPC-MB are all such examples. We believe this is not
Recognizing and removing as many, and as early,
accidental even though there is no explicit explanation on
negatives as possible is an effective way to reduce
the choice in published articles, so we are going to share
noise and to avoid conditioning on unnecessarily
with some to make up this loss, though not in a formal
large conditioning set, which is the precondition for
manner.
normally much smaller than

determines
reliable CI tests and for the success of learning.
Besides, it saves the computing time by prevent
needless tests;

IPC-MB filters negatives by conditioning on empty
set on. Then, one variable is allowed for the
conditioning set, and the checking continues on. This
procedure iterates with increased conditioning set,
resulting with more and more negatives are removed.
Given a problem on
, we don’t know how many
variables belonging to
, saying nothing of which
ones exactly. With search-and-score approach, we have to
∗
measure all possible subsets
⊑ , i.e. the power set of ,
| |
of which the complexity is 2 , NP-complete. KS algorithm
executes in a similar manner, and it requires specifying the
target size of
. Though specifying the size of
So, it is obvious that the decision on a negative is
could limit KS’s running in a predictable scale of space,
made with as small conditioning set as possible, and
obviously it may prevent KS from producing correct results
as early as possible as well, which is the most factor
since it is impossible to “guess” the exact size of the target
for the success of IPC-MB considering that the
each time. So, even assuming that we could define a
reliability of CI test is the most factor to influence
perfect measure (or scoring mechanism), it is not acceptable
the performance of such kind of algorithms.
in practice if we have to do the search in an exponentially
IPC-MB is declared with best trade-off among all
expanding space. Therefore, a more affordable as well as
published works of this type, in terms of soundness, time
sound solution to induce
is expected. To achieve
efficiency, data efficiency and information found [15, 16].
this goal, some additional guidance is needed to figure out a
finer search strategy, i.e. “constraining” the search in a more
efficient manner so that as much as possible fruitless effort
IV. WHY ALL CONSTRAINT LEARNING
Regarding the structure learning of Bayesian network,
there are two primary approaches, i.e. constraint-based
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
can be avoided. With the property of Markov blanket, i.e.
is independent of any
and
conditioned on
∉
is dependent on any
∈
conditioned on
WCE 2010
Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
\
, we may reduce the search space greatly and get

a more efficient algorithm. IAMB is one such important
progress, and it indeed excels the previous naïve approach,
like KS, with obvious gain in efficiency. Since then, all

FastIAMB
2005


effort began to follow the constraint learning approach,

proposing one after another algorithm aiming at better

performance with more and more heuristics introduced. Of
PCMB
2006

course, the procedure is becoming more and more

sophisticated, though it is not what we expect.



V. CONCLUSION
In this paper, we review the published algorithms on
IPCMB
2007/
2008
feature subset selection via the learning of Markov blanket



given a target of interest. Those covered in the discussion

are listed in Table 1 for quick reference.
To the best of our knowledge, IPC-MB achieves the best

trade-off as compared with others, in term of effectiveness,

time efficiency, data efficiency and topology information
efficiency
Data efficient compared to IAMB
Much slower compared to IAMB
Sound in theory
No fundamental difference as
compared to IAMB
Add candidates more greedily to
speed up the learning
Still poor on data efficiency
performance
Sound in theory
Data efficient by making use of
topology information
Poor on time efficiency
Distinguish
spouses
from
parents/children
Distinguish some children from
parents/children
Sound in theory
Most data efficient compared with
previous ones
Much faster than PCMB on
computing
Distinguish
spouses
from
parents/children
Distinguish some children from
parents/children
Best trade-off among this family of
algorithms
inferred.
Besides, we also discuss why all these algorithms are
categorized as constraint-based learning, though not in a
formal manner.
Table 1.
REFERENCES
[1]
Examples in Machine Learning. Artificial Intelligence, 1997. 97(1-2).
Conclusion on the related algorithms for learning
[2]
[3]
Name
Pub.
Year
Comments
KS
1996



1999




IAMB
and its
variants
2003






MMPC/
MB
2003




HITON
PC/MB
2003
Kohavi, R. and G.H. John, Wrappers for Feature Subset Selection.
Artificial Intelligence, 1997. 97(1-2).
Markov Blanket
GS
Blum, A. and P. Langley, Selection of Relevant Features and


Not sound
The first one of this type
Requires specifying MB size in
advance
Sound in theory
Proposed to learn Bayesian network
via the induction of neighbors of each
variable
First proved such kind of algorithm
Work in two phases: grow and shrink
Sound in theory
Actually variant of GS
Simple to implement
Time efficient
Very poor on data efficiency
IAMB’s variants achieve better
performance on data efficiency than
IAMB
Not sound
The first to make use of the underling
topology information
Much more data efficient compared to
IAMB
Much slower compared to IAMB
Not sound
Another trial to make use of the
topology information to enhance data
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
Koller, D. and M. Sahami. Toward Optimal Feature Selection. in
ICML. 1996.
[4]
Tsamardinos, I., C.F. Aliferis, and A.R. Statnikov. Time and sample
efficient discovery of Markov blankets and direct causal relations. in
the Ninth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. 2003: ACM.
[5]
Peña, J.M., et al., Towards scalable and data efficient learning of
Markov boundaries. International Journal of Approximate Reasoning
2007. 45(2).
[6]
Guyon, I. and A. Elisseeff, An Introduction to Variable and Feature
Selection. Journal of Machine Learning Research, 2003. 3.
[7]
Breiman, L., et al., Classification and Regression Trees. 1984:
Wadsworth.
[8]
Pearl, J., Probabilistic reasoning in expert systems. 1988, San Matego:
[9]
Tsamardinos, I., C.F. Aliferis, and A.R. Statnikov. Algorithms for
Morgan Kaufmann.
Large Scale Markov Blanket Discovery. in the Sixteenth International
Florida Artificial Intelligence Research Society Conference. 2003. St.
Augustine, Florida, USA: AAAI Press.
WCE 2010
Proceedings of the World Congress on Engineering 2010 Vol I
WCE 2010, June 30 - July 2, 2010, London, U.K.
[10] Margaritis, D. and S. Thrun. Bayesian Network Induction via Local
Neighborhoods. in Advances in Neural Information Processing
Systems 1999. Denver, Colorado, USA: The MIT Press.
[11] Tsamardinos, I., L.E. Brown, and C.F. Aliferis, The max-min hillclimbing Bayesian network structure learning algorithm Machine
Learning, 2006. 65(1): p. 31-78.
[12] Aliferis, C.F., I. Tsamardinos, and A.R. Statnikov. HITON: A novel
Markov blanket algorithm for optimal variable selection. in American
Medical Informatics Association Annual Symposium. 2003.
[13] Yaramakala, S. and D. Margaritis. Speculative Markov Blanket
Discovery for Optimal Feature Selection. in ICDM. 2005.
[14] Fu, S.-K. and M.C. Desmarais. Local Learning Algorithm for Markov
Blanket Discovery. in Australian Conference on Artificial
Intelligence. 2007. Gold Coast, Australia: Springer.
[15] Fu, S.-K. and M.C. Desmarais. Fast Markov Blanket Discovery
Algorithm Via Local Learning within Single Pass. in Canadian
Conference on AI. 2008. Windsor, Canada: Springer.
[16] Fu, S.-K. and M.C. Desmarais. Tradeoff Analysis of Different
Markov Blanket Local Learning Approaches. in Advances in
Knowledge Discovery and Data Mining, 12th Pacific-Asia
Conference (PAKDD). 2008. Osaka, Japan: Springer.
[17] Spirtes, P., C. Glymour, and R. Scheines, Causation, Prediction and
Search (2nd Edition). 2001: The MIT Press.
[18] Pearl, J., Causality: Models, Reasoning, and Inference. 2000:
Cambridge University Press.
[19] Heckerman, D. A Bayesian Approach to Learning Causal Networks.
in the Eleventh Annual Conference on Uncertainty in Artificial
Intelligence. 1995. Montreal, Quebec, Canada: Morgan Kaufmann.
ISBN: 978-988-17012-9-9
ISSN: 2078-0958 (Print); ISSN: 2078-0966 (Online)
WCE 2010
×

Report this document