Density-Linked Clustering

Document technical information

Format pdf
Size 304.2 kB
First found May 22, 2018

Document content analysis

Category Also themed
Language
English
Type
not defined
Concepts
no text concepts found

Persons

Organizations

Places

Transcript

DeLiClu: Boosting Robustness, Completeness, Usability,
and Efficiency of Hierarchical Clustering by a Closest
Pair Ranking
Elke Achtert, Christian Böhm, and Peer Kröger
Institute for Computer Science, University of Munich, Germany
{achtert,boehm,kroegerp,}@dbs.ifi.lmu.de
Abstract. Hierarchical clustering algorithms, e.g. Single-Link or OPTICS compute the hierarchical clustering structure of data sets and visualize those structures by means of dendrograms and reachability plots. Both types of algorithms
have their own drawbacks. Single-Link suffers from the well-known single-link
effect and is not robust against noise objects. Furthermore, the interpretability
of the resulting dendrogram deteriorates heavily with increasing database size.
OPTICS overcomes these limitations by using a density estimator for data grouping and computing a reachability diagram which provides a clear presentation
of the hierarchical clustering structure even for large data sets. However, it requires a non-intuitive parameter ε that has significant impact on the performance
of the algorithm and the accuracy of the results. In this paper, we propose a novel
and efficient k-nearest neighbor join closest-pair ranking algorithm to overcome
the problems of both worlds. Our density-link clustering algorithm uses a similar density estimator for data grouping, but does not require the ε parameter
of OPTICS and thus produces the optimal result w.r.t. accuracy. In addition, it
provides a significant performance boosting over Single-Link and OPTICS. Our
experiments show both, the improvement of accuracy as well as the efficiency
acceleration of our method compared to Single-Link and OPTICS.
1
Introduction
Hierarchical clustering methods determine a complex, nested cluster structure which
can be examined at different levels of generality or detail. The complex cluster structure can be visualized by concepts like dendrograms or reachability diagrams. The
most well-known hierarchical clustering method is Single-Link [1] and its variants like
Complete-Link and Average-Link [2]. Single-Link suffers from the so-called singlelink effect which means that a single noise object bridging the gap between two actual
clusters can hamper the algorithm in detecting the correct cluster structure. The time
complexity of Single-Link and its variants is at least quadratic in the number of objects.
Another hierarchical clustering algorithm is OPTICS [3], which follows the idea
of density-based clustering [4], i.e. clusters are regions of high data density separated
by regions of lower density. OPTICS solves some of the problems of Single-Link but
only to the expense of introducing new parameters minPts and ε. The latter is not very
intuitive and critical for both, performance of the algorithm and accuracy of the result.
If ε is chosen too low, fundamental information about the cluster structure is lost, if it
is chosen too high the performance of the algorithm decreases dramatically.
In this paper, we introduce a novel hierarchical clustering algorithm DeLiClu (Density Linked Clustering) that combines the advantages of OPTICS and Single-Link by
fading out their drawbacks. Our algorithm is based on a closest pair ranking (CPR). The
objective of a CPR algorithm is: given two sets R and S of feature vectors, determine
in a first step that pair of objects (r, s) ∈ (R × S) having minimum distance, in the next
step the second pair, and so on. Well-known CPR algorithms like [5] operate on static
data sets which are not subject to insertions or deletions after initialization of the ranking. Our new DeLiClu algorithm, however, needs a ranking algorithm where after each
fetch operation for a new pair (r, s) the object s is deleted from S and inserted into R.
We show how the ranking algorithm can be modified to allow the required update operations without much additional overhead and how Single-Link can be implemented on
top of a CPR. This allows the use of an index structure which makes the algorithm more
efficient without introducing the parameter ε like OPTICS does. Finally, we describe
how the density-estimator of OPTICS can be integrated into our solution.
The rest of this paper is organized as follows: Sec. 2 discusses related work. In Sect
3 our novel algorithm is described. Sec. 4 presents an experimental evaluation. Sec. 5
concludes the paper.
2
Related Work
Hierarchical Clustering. Hierarchical clustering algorithms produce a nested sequence
of clusters, resulting in a binary tree-like representation, a so-called dendrogram. The
root of the dendrogram represents one single cluster, containing the n data points of the
entire data set. Each of the n leaves of the dendrogram corresponds to one single cluster
which contains only one data point. Hierarchical clustering algorithms primarily differ
in the way they determine the similarity between clusters. The most common method
is the Single-Link method [1] which measures the similarity between two clusters by
the similarity of the closest pair of data points belonging to different clusters. This
approach suffers from the so-called single-link effect, i.e. if there is a chain of points
between two clusters then the two clusters may not be separated. In the Complete-Link
method the distance between two clusters is the maximum of all pairwise distances
between the data points in the two clusters. Average-Link clustering merges in each
step the pair of clusters having the smallest average pairwise distance of data points in
the two clusters. A major drawback of the traditional hierarchical clustering methods
is that dendrograms are not really suitable to display the full hierarchy for data sets of
more than a few hundred compounds. Even for a small amount of data, a reasonable
interpretation of the dendrogram is almost impossible due to its complexity. The singlelink effect can also be seen in the figure: as an impact of the connection line between
the two clusters Single-Link computes no clearly separated clusters.
OPTICS [3] is another hierarchical clustering algorithm, but uses the concept of
density based clustering and thus reduces significantly the single-link effect. Additionally, OPTICS is specifically designed to be based on range queries which can be
efficiently supported by index-based access structures. The density estimator used by
OPTICS consists of two values for each object, the core distance and the reachability
distance w.r.t. parameters minPts ∈ N and ε ∈ R. The clustering result can be displayed
in a so-called reachability plot that is more appropriate for very large data sets than a
dendrogram. A reachability plot consists of the reachability values on the y-axis of all
objects plotted according to the cluster order on the x-axis. The “valleys” in the plot
represent the clusters, since objects within a cluster have lower reachability distances
than objects outside a cluster. Figure 1 shows examples of reachability plots with different parameter settings for ε and minPts. The effect of minPts to the resulting cluster
structure is depicted in the left part of Figure 1. The upper part shows a reachability plot
resulting from an OPTICS run with minPts = 2 where no meaningful cluster structure
has been detected. If the value of minPts is increased as in the lower part of the figure,
the two clusters in the data set can be seen as valleys in the reachability plot. The second
parameter ε is much more difficult to determine but has a considerable impact on the
efficiency and the accuracy of OPTICS. If ε is chosen too small, fundamental information about the cluster structure will be lost. The right part of figure 1 shows this effect in
the upper diagram where the information about clusters consisting of data points with
reachability values greater than ε = 12 is no longer existent.
Closest Pair Ranking. The closest pair problem is a classical problem of computational
geometry [6]. The intention is to find those two points from given data sets R and S
whose mutual distance is the smallest. The CPR determines in the first step that pair of
objects in R × S having the smallest distance, in the next step the second pair, etc. The
number of pairs to be reported is a priori unknown. In the database context the CPR
problem was introduced first in [5], calling it distance join. An incremental algorithm
based on the R-Tree family is proposed. For each data set R and S a spatial index is
constructed as input. The basic algorithm traverses the two index structures, starting at
the root of the two trees. The visited pairs of nodes are kept in a priority queue sorted
by their distances. If the first entry of the priority queue exists of a pair of data points,
then the pair is reported as the next closest pair. Otherwise, the pair is expanded and all
possible pairs formed by inspecting the children of the two nodes are inserted into the
priority queue. The algorithm terminates if all closest pairs are reported or the query is
stopped by the user. CPR algorithms operate on static data sets, i.e. they do not support
insertions or deletions of objects after initializing the ranking query. Our new DeLiClu
algorithm, however, needs shifting object s from S to R after reporting pair (r, s). In
Section 3 we propose a solution for this special case.
3
Density-Linked Clustering
Our new algorithm DeLiClu combines the advantages of Single-Link and OPTICS by
fading out the drawbacks mentioned in Section 2. To achieve these requirements we introduce a density-smoothing factor minPts into hierarchical clustering and use as representation of the clustering result reachability plots like OPTICS. In contrast to OPTICS
we avoid the introduction of the non-intuitive parameter ε which is critical for both,
performance of the algorithm and completeness of the result. In addition, we improve
the performance over both algorithms by applying powerful database primitives such as
the similarity join and a CPR, and by applying index structures for feature spaces.
minPts = 2
data set
minPts = 40
İ = 12
data set
İ = 60
Fig. 1. Impact of parameters minPts and ε
3.1
General Idea of DeLiClu
Typical hierarchical clustering algorithms work as follows: They keep two separate sets
of points, those points which have already been placed in the cluster structure and those
which have not. In each step, one point of the latter set is selected and placed in the
first set. The algorithm always selects that point which minimizes the distance to any
of the points in the first set. Assume the algorithm has already done part of its work,
and some of the points have already been placed in the cluster structure. What actually
happens then is that the closest pair is selected between the set R of those points which
are already assigned to clusters and the set S of the points which are not yet processed.
This means, we can also reformulate the main loop of the general algorithm into:
determine the closest pair (r, s) ∈ (R × S);
migrate s from S into R;
append s to cluster structure / reachability plot;
Note that we still have to render more precisely what exactly we mean by the notion
closest pair because we have to integrate the density-based smoothing factor minPts
into this notion. Additionally, since the reachability plot shows for each object its reachability distance we have to define a proper density distance for our DeLiClu algorithm.
However, this will be done in Section 3.3 and until then, we simply mean the closest
pair according to the Euclidean distance and assign each object with its closest pair or
nearest neighbor distance to the reachability plot.
If the closest pair from (R × S) would be determined in each step from scratch, we
would do a lot of unnecessary work. Instead, we like to save the status of processing
from one call of the closest pair determination to the next one. But since we migrate
object s from S to R after the closest pair (r, s) has been reported, we need a ranking
algorithm which supports insertions or deletions after initialization. We show in the next
section how the standard algorithm [5] can be extended to allow the required object
migration during the ranking. The core of our DeLiClu clustering algorithm now is:
1. Let R contain an arbitrary start object from data set D;
2. Let S be D \ R;
3. Initialize the CPR over (R × S);
4. Take the next pair (r, s) from the ranking;
5. Migrate s from S into R;
6. Append s to the reachability plot;
7. Continue with step (4) until all points are handled;
The critical remaining aspects are the migration of point s from S into R (step 5) and the
introduction of the density-based smoothing factor minPts and a proper density distance
definition.
3.2
Closest Pair Ranking with Object Migration
The original algorithm for CPR without object migration requires the two data sets to be
stored in hierarchical index structures such as R-trees [7]. The algorithm uses a priority
queue into which pairs of nodes and pairs of data objects can be inserted. The entries
in the priority queue are ordered by ascending distances between the pair of objects
(nodes, respectively) in the data space. Upon each request, the algorithm dequeues the
top pair. If it is a pair of data objects, it is reported as the result of the request. Otherwise,
the pair is expanded, i.e. for all pairs of child nodes the distances are determined and the
pairs are inserted into the queue. Several strategies exist to decide which of the elements
of a pair is expanded (left, right, or both). We assume here a symmetric expansion of
both elements of the pair. Further, we assume that both indexes have exactly the same
structure. Although the tree for R initially contains only the arbitrarily chosen start
element, we use a full copy of the directory of S for convenience, because this method
facilitates insertion of any element of S into R. We simply use the same path as in the
tree storing S. No complex insert and split algorithm has to be applied for insertion.
Whenever a new element s is inserted into the index storing the data set R, we have
to determine a suitable path P = (root, node1 , ..., nodeh , s) from the root to a leaf node
for this element (including the element itself). Comparing the nodes of this path with
the nodes of the index for S, we observe that some node pairs might already have been
inserted into the priority queue, others may not. Some of the pairs (e.g. (rootR , rootS ))
might have even already been removed from the priority queue. We call such removed
pairs processed. Processed pairs are a little bit problematic because they require catchup work for migrated objects. Processed pairs can be easily found by traversing the tree
S top-down. A pair should be in the queue if the parent pair has already been processed
(i.e. has a distance smaller than the current top element of the priority queue), but the
pair itself has a distance higher than the top element.
After a pair of objects (r, o) has been processed, the formerly not handled object
o is migrated from S to the set of already processed objects R. The catch-up work
which now has to be done consists of the insertion of all pairs of objects (nodes, respectively) (o, s) ∈ R × S into the priority queue for which the parent pair of nodes
(o.parent, s.parent) has already been processed. The complete recursive method is
called reInsertExpanded and is shown in Figure 2. Initially, reInsertExpanded is
called with the complete path of the migrated object o in R and the root node of S.
reInsertExpanded(Object[] path, Object o)
if (path[0], o) is a pair of objects then
insert the pair (path[0], o) into priority queue;
if (path[0], o) is a pair of nodes and has not yet been expanded then
insert the pair (path[0], o) into priority queue;
if (path[0], o) is a pair of nodes and has already been expanded then
determine all child nodes ochild of o;
reInsertExpanded(tail(path), ochild );
Fig. 2. Algorithm reInsertExpanded
3.3
The Density Estimator MinPts
Until now, we have re-engineered the Single-Link method without applying any density
estimator for enhancing the robustness. Our re-engineering has great impact on the
performance of the algorithm because now a powerful database primitive is applied to
accelerate the algorithm. We will show in Section 4 that the performance is significantly
improved. But our new implementation also offers an easy way to integrate the idea of
the density estimator minPts into the algorithm without using the difficult parameter ε of
OPTICS. To determine the reachability distance of an object shown in the reachability
plot we consider additionally the k-nearest neighbor distance of the point where k =
minPts. We call this distance density distance and it is formally defined as follows:
Definition 1 (density distance). Let D be a set of objects, q ∈ D and D IST be a
distance function on objects in D. For minPts ∈ N, minPts ≤ |D| let r be the minPtsnearest neighbor of q w.r.t. D IST. The density distance of an object p ∈ D relative from
object q w.r.t. minPts is defined as
D EN D ISTminPts (p, q) = max{D IST(q, r), D IST(q, p)}.
The density distance of of an object p relative from object q is an asymmetric distance measure that takes the density around p into account and is defined as the maximum value of the minPts-nearest neighbor distance of p and the distance between p and
q. Obviously, the density distance of DeLiClu is equivalent to the reachability distance
of OPTICS w.r.t. the same parameter minPts and parameter ε = ∞. Our algorithm
DeLiClu can adopt the density-based smoothing factor minPts by ordering the priority queue using the density distance rather than the Euclidean distance. The rest of the
algorithm remains unchanged. Obviously, this modification can be done without introducing the parameter ε. The cluster hierarchy is always determined completely, unlike
in OPTICS. And in contrast to OPTICS a guaranteed complete cluster result is not
payed with performance deterioration.
The k-nearest neighbor distance where k = minPts can be determined for all points
in a preprocessing step which applies a k-nearest neighbor join of the data set. Some
methods have been proposed for this purpose [8, 9] but unfortunately none for the simple R-tree and its variants. Therefore, we apply a new algorithm which is described in
the next section.
3.4
The k-NN Join on the R-tree
The k-nn join combines each of the points of R with its k nearest neighbors in S.
Algorithms for the k-nn join have been reported in [8] and in [9]. The first algorithm is
based on the MuX-index structure [10], the latter is on top of a grid order. Unfortunately,
there is no k-nn join algorithm for the R-tree family. Thus, in the following we present
a k-nn join algorithm based on the R-tree [7] and its variants, e.g. R*-tree[11].
Formally we define the k-nn join as follows:
Definition 2 (k-nn join R n S). Let R and S be sets of objects, and D IST be a distance
function between objects in R and S. RnS is the smallest subset of R×S that contains
for each point of R at least k points of S and for which the following condition holds:
∀(r, s) ∈ R n S, ∀(r, s0 ) ∈ R × S \ R n S : D IST(r, s) < D IST(r, s0 )
Essentially, the k-nn join combines each point of the data set R with its k-nearest
neighbors in the data set R. Each point of R appears in the result set exactly k times.
Points of S may appear once, more than once (if a point is among the k-nearest neighbors of several points in R) or not at all (if a point does not belong to the k-nearest
neighbors of any point in R).
For the k-nn join R n S based on the R-tree it is assumed that each data set R
and S is stored in an index structure belonging to the R-tree family. The data set R of
which the nearest neighbors are searched for each point is denoted as the outer point set.
Consequently, S is the inner point set. The data pages of R and S are processed in two
nested loops whereas each data page of the outer set R is accessed exactly once. The
outer loop iterates over all data pages pr of the outer point set R which are accessed in
an arbitrary order. For each data page pr, the data pages ps of the inner point set S are
sorted in ascending order to their distance to pr. For each point r stored in the data page
pr, a data structure for the k- nearest neighbor distances, short a k-nn distance list, is
allocated. The distances of candidate points are maintained in these k-nn distance lists
until they are either discarded and replaced by smaller distances of better candidate
points or until they are confirmed to be the actual nearest neighbor distances of the
corresponding point. A distance is confirmed if it is guaranteed that the database cannot
contain any points being closer to the given object than this distance. The last distance
value in the k-nn distance list belonging to a point r is the (actual) k-nn distance of
r: points and data pages beyond that distance need not to be considered. The pruning
distance of a data page is the maximum (actual) k-nn distance of all points stored in
this page. All data pages ps ∈ S having a distance from a given data page pr ∈ R
that exceeds the pruning distance of the data page pr can be safely neglected as joinpartners of that data page pr. Thus, in the inner loop only those data pages ps have to be
considered having a distance to the current data page pr less or equal than the pruning
distance of pr. Analogous, all points s of a data page ps having a distance to a current
point r greater than the current k-nn distance of r can be safely pruned and do not have
to be taken into consideration as candidate points.
3.5
Algorithm DeLiClu
The algorithm DeLiClu is given in Figure 3. In a preprocessing step, the k-nearest
neighbor distance for all points is determined as described in Section 3.4. In the follow-
DeLiClu(SetOfObjects S)
kNNJoin(S,S);
copy the index storing S to the index storing R;
s := start object ∈ S;
write (s, ∞) to output;
migrate s from S to R;
add pair (S.root, R.root) to priority queue;
while S 6= ∅ do
p:= minimum pair in priority queue;
if p = (nS , nR ) is a pair of nodes then
insert all combinations of (nS .children, nr .children) into priority queue;
else p = (s, r) is a pair of objects
write (s, denDist(s, r)) to output;
reInsertExpanded(path(s), root);
Fig. 3. Algorithm DeLiClu
ing R, denotes the set of objects already processed and S indicates the set of objects
which are still not yet handled. The algorithm starts with an arbitrary chosen start object
s ∈ S, migrates s from S to R and writes s with a density distance of infinity to output.
Note that migration of s from S to R means, that s is stored in the index structure of
R in the same path as in S. Thus, we do not need any complex insert or split algorithm
upon object migration. The two index structures of R and S only need to have the same
structure, i.e. the same directory and data nodes although the tree for R initially contains
no point.
The algorithm uses a priority queue into which pairs of nodes and pairs of data
objects from S × R can be inserted. The entries in the priority queue are sorted in
ascending order by the distance between the nodes of the pair or the density distance
between the objects of the pair. The first pair inserted into the queue is the pair of
nodes existing of the root of the index of S and the root of the index of R. In each
step, the top pair having minimum distance is dequeued from the priority queue. If
it is a pair (ns , nr ) of nodes, the pair will be expanded, i.e. all combinations of the
children of ns with the children of nr are inserted into the priority queue. Otherwise,
if the top pair of the priority queue consists of a pair (s, r) of data objects from S ×
R, the not yet processed object s ∈ S is written to output with the density distance
D EN D ISTminPts (s, r). Afterwards, s is migrated from S to R. As described in Section
3.2, objects belonging to already expanded nodes of the path of s have to be reinserted
into the priority queue by invoking the algorithm reinsertExpanded (see Figure 2). The
algorithm terminates if all objects are moved from S to R.
4
Experimental Evaluation
All experiments have been performed on Linux workstations with two 64-bit 1.8 GHz
CPU and 8 GB main memory. We used a disk with a transfer rate of 45 MB/s, a seek
time of 4 ms and a latency delay of 2 ms. For either technique a LRU cache of about
120 000
20 000
80 000
60 000
40 000
DeLiClu
OPTICS
SLINK
20 000
0
CPU + I/O-Time [sec]
CPU + I/O-Time [sec]
100 000
DeLiClu
OPTICS
SLINK
15 000
10 000
5 000
0
2
5
dimensionality
10
(a) Performance w.r.t. dimensionality
10
20
30
40
50
60
70
size [*1,000]
80
90
100
(b) Performance w.r.t. size
Fig. 4. Performance analysis
50% of the data set size was allocated. The OPTICS algorithm was supported by an
R-tree index structure. Unless otherwise specified, the minPts parameter of DeLiClu
and OPTICS was set to 5. The ε-parameter of OPTICS was set to the optimal value
w.r.t. accuracy. Performance is presented in terms of the elapsed time including I/O and
CPU-time. Beside synthetic data sets, we used a data set containing 500,000 5D featurevectors generated from the SEQUOIA benchmark and the El Nino data set from the UCI
KDD data repository, containing about 800 9D data objects.
Performance speed-up. We first compared the performance of the methods. As it can
be seen in Figure 4(a) DeLiClu significantly outperforms OPTICS and SLINK w.r.t.
the dimensionality of the database. In Figure 4(b), we can observe that DeLiClu also
outperforms SLINK and OPTICS w.r.t. the number of data objects is. Obviously, the
speed-up of DeLiClu grows significantly with increasing database size. Similar results
can be made on the SEQUOIA benchmark (results are not shown due to space limitations). DeLiClu achieved a speed-up factor of more than 20 over OPTICS and a
speed-up factor of more than 50 over SLINK.
Improvement of accuracy. The significant effect of parameter ε on the results of the
OPTICS algorithm is shown in Figure 5 (El Nino data). The left part of the figure shows
a reachability plot resulting from the new algorithm DeLiClu, the middle part of the
figure shows a reachability plot resulting from an OPTICS run with parameter ε chosen
too small. For this experiment, ε was set to a value for which the runtime of OPTICS
was approximately the same as for DeLiClu. Apparently, OPTICS lost a significant part
of the whole cluster information due to the wrongly chosen ε. The interpretability of
the dendrogram depicted in the right part of the figure is very weak in comparison with
the reachability plot resulting from the DeLiClu algorithm. DeLiClu generates strongly
separated clusters which cannot be seen in the dendrogram. Similar results have been
achieved on the SEQUOIA benchmark.
Fig. 5. Comparison of accuracy on real-world data set (El Nino data set)
5
Conclusions
We proposed the new algorithm DeLiClu based on a novel closest pair ranking algorithm that efficiently computes the hierarchical cluster structure. DeLiClu shows improved robustness over Single-Link w.r.t. noise and avoids the single-link effect by using a density estimator. In contrast to OPTICS it guarantees the complete determination
of the cluster structure. It has an improved usability over OPTICS by avoiding the nonintuitive parameter ε. Our experimental evaluation showes that DeLiClu significantly
outperforms Single-Link and OPTICS in terms of robustness, completeness, usability
and efficiency.
References
1. Sibson, R.: SLINK: An optimally efficient algorithm for the single-link cluster method. The
Computer Journal 16 (1973)
2. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall (1988)
3. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: OPTICS: Ordering points to identify
the clustering structure. In: Proc. SIGMOD. (1999)
4. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. KDD. (1996)
5. Hjaltason, G.R., Samet, H.: Incremental distance join algorithms for spatial databases. In:
Proc. SIGMOD. (1998)
6. Preparata, F.P., Shamos, M.I.: Computational Geometry: An Introduction. Springer Verlag
(1985)
7. Guttman, A.: R-Trees: A dynamic index structure for spatial searching. In: Proc. SIGMOD.
(1984)
8. Böhm, C., Krebs, F.: The k-nearest neighbor join: Turbo charging the KDD process. KAIS
6 (2004)
9. Xia, C., Lu, H., Ooi, B.C., Hu, J.: GORDER: An efficient method for KNN join processing.
In: Proc. VLDB. (2004)
10. Böhm, C., Kriegel, H.P.: A cost model and index archtecture for the similarity join. In: Proc.
ICDE. (2001)
11. Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-Tree: An efficient and robust
access method for points and rectangles. In: Proc. SIGMOD. (1990)

Similar documents

×

Report this document