Paper Title (use style: paper title)

Document technical information

Format pdf
Size 1.8 MB
First found Nov 13, 2015

Document content analisys

not defined
no text concepts found


Frank E. Loy
Frank E. Loy

wikipedia, lookup

V. Kumar Murty
V. Kumar Murty

wikipedia, lookup

John Wiley Price
John Wiley Price

wikipedia, lookup

Michael S. Rogers
Michael S. Rogers

wikipedia, lookup




Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields
22-23 August 2014, Koc University, ISTANBUL/TURKEY
Analysis to find the most effective features to predict
breast cancer; A data mining aproach
Mohammad Esmaeil Akbari4
Fariba Tat , Sahar Azadi Ghale Taki2, Sadjaad
Ozgoli3, Mahdi Sojoodi5
Chairman of Cancer Research Center
Prof. of Department of Surgery, Shahid Beheshti University
of Medical Sciences
Tehran, Iran
[email protected]
Electrical and Computer Engineering
Tarbiat Modares University
Tehran, Iran
[email protected],
[email protected], [email protected],
[email protected]
Abstract— This paper aims to present a hybrid intelligence
model that uses the cluster analysis techniques with feature
selection for analyzing clinical cancer diagnoses. Our model
provides an option of selecting a subset of salient features for
performing clustering and comprehensively considers the use of
most existing models that use all the features to perform
clustering. In particular, we study the methods by selecting
salient features to identify clusters using a comparison of
coincident quantitative measurements. When applied to
benchmark breast cancer datasets, experimental results indicate
that our method outperforms several benchmark filter- and
wrapper-based methods in selecting features used to discover
natural clusters, maximizing the between-cluster scatter and
minimizing the within-cluster scatter toward a satisfactory
clustering quality. The experimental dataset is based on the data
gathered in a hospital in Tehran.
Keywords—cluster analysis;
selection; cancer diagnoses.
Breast cancer is the second leading cause of death after
cardiovascular diseases in the world. Health professionals are
seeking ways for suitable treatment and quality of care in these
groups of patients. Survival prediction is important for both
physicians and patients in order to choose the best way of
management. Today diagnosis of a disease is a vital job in
medicine. It is an essential to interpret the correct diagnosis of
patient with the help clinical examination and investigations.
Computer information based decision support system can play
an important role in accurate diagnosis and cost effective
treatment. Most hospitals have a huge amount of patient data,
which is rarely used to support clinical diagnosis. It is question
why we cannot use this data in clinical diagnosis and patient
management? Is it possible to formulate own area based
prediction system concerned with specific disease by using
data mining techniques [1]? Data mining is the computational
process of discovering patterns in large data sets involving
methods at the intersection of artificial intelligence, machine
learning, statistics, and database systems [2]. Basically data
mining technique is concerned with data processing,
identifying patterns and trends in information. In other words,
data mining simply means collection and processing data in
systemic manner by using computer based programs and
subsequent formation of disease prediction or patient
management system aid. Data mining principles have been
known around for many years, but, with the advent of
information technology, nowadays it is even more prevalent.
Data mining is not all about the database software that you are
using. You can perform data mining with comparatively
modest database systems and simple tools, including creating
and writing your own, or using off the shelf software packages.
Complex data mining benefits from the past experience and
algorithms defined with existing software and packages, with
certain tools gaining a greater affinity or reputation with
different techniques [3]. This technique is routinely use in large
number of industries like engineering, medicine, crime
analysis, expert prediction, Web mining, and mobile
computing, besides others utilize Data mining [4]. Machine
learning [5] [6], is concerned with the design and development
of algorithms that allow computers to evolve behaviors learned
from databases and automatically learn to recognize complex
patterns and make intelligent decisions based on data. However
the massive toll of available data poses a major obstruction in
discovering patterns. Feature Selection attempts to select a
subset of attributes based on the information gain [7].
Classification is performed to assign the given set of input data
to one of many categories [8]. Data analysis procedures can be
dichotomized as either exploratory or confirmatory, based on
the availability of appropriate models for the data source, but a
key element in both types of procedures (whether for
hypothesis formation or decision-making) is the grouping, or
classification of measurements based on either (i) goodness-offit to a postulated model, or (ii) natural groupings (clustering)
revealed through analysis. Cluster analysis is the organization
of a collection of patterns (usually represented as a vector of
measurements, or a point in a multidimensional space) into
clusters based on similarity. Intuitively, patterns within a valid
cluster are more similar to each other than they are to a pattern
belonging to a different cluster. It is important to understand
the difference between clustering (unsupervised classification)
and discriminant analysis (supervised classification). In
supervised classification, we are provided with a collection of
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields
22-23 August 2014, Koc University, ISTANBUL/TURKEY
labeled (pre-classified) patterns; the problem is to label a newly
encountered, yet unlabeled, pattern. Typically, the given
labeled (training) patterns are used to learn the descriptions of
classes which in turn are used to label a new pattern. In the
case of clustering, the problem is to group a given collection of
unlabeled patterns into meaningful clusters. In a sense, labels
are associated with clusters also, but these category labels are
data driven; that is, they are obtained solely from the data.
Clustering is useful in several exploratory pattern-analysis,
grouping, decision-making, and machine-learning situations,
including data mining, document retrieval, image
segmentation, and pattern classification [9].
In [10], the performance criterion of supervised learning
classifiers such as Naïve Bayes, SVM-RBF kernel, RBF neural
networks, Decision trees (J48) and simple CART are
compared, to find the best classifier in breast cancer datasets
(WBC and Breast tissue). The experimental result shows that
SVM-RBF kernel is more accurate than other classifiers; it
scores accuracy of 96.84% in WBC and 99.00% in Breast
tissue. In [11] , the performance of C4.5, Naïve Bayes, Support
Vector Machine (SVM) and K- Nearest Neighbor (K-NN) are
compared to find the best classifier in WBC. SVM proves to be
the most accurate classifier with accuracy of 96.99%. In [12],
the performance of decision tree classifier (CART) with or
without feature selection in breast cancer datasets Breast
Cancer, WBC and WDBC. CART achieves accuracy of
69.23% in Breast Cancer dataset without using feature
selection, 94.84% in WBC dataset and 92.97% in WDBC
dataset. When using CART with feature selection (Principal
Components Attribute Eval), it scores accuracy of 70.63% in
Breast Cancer dataset, 96.99 in WBC dataset and 92.09 in
WDBC dataset. When CART is used with feature selection
(ChiSquared Atrribute Eval), it scores accuracy of 69.23% in
Breast Cancer dataset, 94.56 in WBC dataset and 92.61 in
WDBC dataset. In [13], the performance of C4.5 decision tree
method obtained 94.74% accuracy by using 10-fold cross
validation with WDBC dataset. In [14], the neural network
classifier is used on WPBC dataset. It achieves accuracy of
70.725%. In [15], a hybrid method is proposed to enhance the
classification accuracy of WDBC dataset (95.96) with 10 fold
cross validation. In [16], the performance of linear discreet
analysis method obtained 96.8% accuracy with WDBC dataset.
In [17], the accuracy obtained 95.06% with neuron- fuzzy
techniques when using WDBC dataset. In [18], an accuracy of
95.57% was obtained with the application of supervised fuzzy
clustering technique with WDBC dataset.
The major contributions of the current work are twofold.
First, we have developed a K-means variant that can
incorporate background knowledge in the form of instancelevel constraints, thus demonstrating that this approach is not
limited to a single clustering algorithm. In particular, we have
presented our modifications to the K-means algorithm and have
demonstrated its performance on six data sets.
Second, we have used the best feature for classification so
we can predict the time of cancer recrudescence which is very
important for prevention of death due to cancer.
In the next section, we provide some backgrounds on the
Clustering algorithm such as K-means algorithm and SVM.
Next, we describe our methods and result in some tables in
section 3 and 4. Finally, Section 5 summarizes our
contributions [19].
To evaluate the effectiveness of NDR for capturing cluster
structures, two classical clustering algorithms, HC and Kmeans, are applied to the reduced feature spaces. These two
algorithms are representatives for two kinds of widely used
clustering approaches. HC uses agglomerative and divisive
strategies and divides data into a sequence of nested partitions,
where partitions at one level are joined as clusters at the next
level. The number of clusters can be determined immediately at
special level upon requirements. While, K-means is one of the
most widely used center-based clustering algorithms which
uses a partitioning strategy to assign objects into fixed clusters.
The algorithm regards data vectors as a point set in a highdimensional space. According to the input clustering number,
K-means randomly selects centroid points for each cluster and
allocates each of data point into one of these clusters based on
its minimum distance to these centroid points. After necessary
optimizing steps, K-means can generate a good clustering
solution [11],[12].
A. K-means
Clustering is an important and popular technique in data
mining. It partitions a set of objects in such a manner that
objects in the same clusters are more similar to each another
than objects in the different cluster according to certain
predefined criteria. K-means is simple yet an efficient method
used in data clustering. However, K-means has a tendency to
converge to local optima and depends on initial value of cluster
centers. In the past, many heuristic algorithms have been
introduced to overcome this local optima problem.
Nevertheless, these algorithms too suffer several short-comings
[22]. In this paper, we present an efficient hybrid evolutionary
data clustering algorithm referred to as K-MCI, whereby, we
combine K-means with modified cohort intelligence. Our
proposed algorithm is tested on several standard data sets from
UCI Machine Learning Repository and its performance is
compared with other well-known algorithms such as K-means,
K-means++, cohort intelligence (CI), modified cohort
intelligence (MCI), genetic algorithm (GA), simulated
annealing (SA), tabu search (TS), ant colony optimization
(ACO), honeybee mating optimization (HBMO) and particle
swarm optimization (PSO). The simulation results are very
promising in the terms of quality of solution and convergence
speed of algorithm.
With the development of clinical technologies, different
tumor features have been collected for breast cancer diagnosis.
Filtering all the pertinent feature information to support the
clinical disease diagnosis is a challenging and time consuming
task. The objective of this research is to diagnose breast cancer
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields
22-23 August 2014, Koc University, ISTANBUL/TURKEY
based on the extracted tumor features. Feature extraction and
selection are critical to the quality of classifiers founded
through data mining methods. To extract useful information
and diagnose the tumor, a hybrid of K-means and support
vector machine (K-SVM) algorithms is developed. The Kmeans algorithm is utilized to recognize the hidden patterns of
the benign and malignant tumors separately. The membership
of each tumor to these patterns is calculated and treated as a
new feature in the training model. Then, a support vector
machine (SVM) is used to obtain the new classifier to
differentiate the incoming tumors. Based on 10-fold cross
validation, the proposed methodology improves the accuracy to
97.38%, when tested on the Wisconsin Diagnostic Breast
Cancer (WDBC) data set from the University of California –
Irvine machine learning repository. Six abstract tumor features
are extracted from the 32 original features for the training
phase. The results not only illustrate the capability of the
proposed approach on breast cancer diagnosis, but also shows
time savings during the training phase. Physicians can also
benefit from the mined abstract tumor features by better
understanding the properties of different types of tumors [23].
In this paper, we have investigated two data mining
techniques: Clustering and Classification. In this paper, we
used these algorithms to predict the survivability rate of breast
cancer data set. We have selected these two clustering and
classification techniques to find the most suitable one for
predicting cancer survivability rate [24]. The 24 of 76 features
used in our study are listed in Table I.
Family history
Size of the original
Marital status
lymph nodes
Cancer has been
found in the lymph
Tumor protein
Radio therapy
Antigen identifiedby
monoclonal antibody
In order to make the gathered data being in hospital in
Tehran in numerical format, a coding scheme is used. This
coding is depicted in Table II for coding of variables describing
the general and Table III for variables coding cancer.
Type of abortion
Number of
Estrogen receptor
(estrogen and
Progesterone receptor
Duration of
hormone use
Human epidermal
growth factor
receptor 2
Type AB
Female 0
Male 1
Collegiate 1
Diploma 2
School 3
Guidance 4
Illiterate 5
Criminal 1
Medical 2
C/M 3
Unknown 6
No 9
Yes 1
Yes/no 2
Unknown 6
No 9
Yes 1
Yes/no 2
Unknown 6
No 9
Singel 1
Married 2
Divorce 3
Widowed 4
Unknown 6
Frist degree 1
Second degree 2
Yes/unknown 3
Unknown 6
No 9
Histectomy 1
Natural 2
Bcs 1
Mrm 2
AXLND/padding 1
AXLND/darrinage 2
SLN/darrinage 3
A number on a scale
of 0 through IV
The type of
Grading is a
way of classifying
cancer cells
Yes 1
Neo 2
No 9
Tamoxifin 1
Letrozol 2
Aromysin 3
Unknown 6
Unknown 6
Padding 8
herceptin 8
No 9
no 9
Darrinage 10
SLN/padding 12
In addition, the coding depicted in Table IV is used to
numerate the result of test on each feature.
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields
22-23 August 2014, Koc University, ISTANBUL/TURKEY
yes positive
no negative
In the next section the results of clustering and
classification will be discussed.
We use 3 clustering for dataset that select some feature in 3
clusters .this cluster show that some feature such as p53 are
more than effective to predict breast cancer. In this study, the
models were evaluated based on the accuracy measures
discussed above (classification accuracy, sensitivity and
specificity). The results were achieved using 74 features for
each model, and are based on the average results obtained from
the test dataset. It has been chosen 983 patients of 1621 as train
data and they are clustered into three, and the results are exist
in Table V. This classification algorithm uses LVI as a table
and predicts cancer more precisely. Using this algorithm,
accuracy numbers are listed in Table V. simulation results have
been achieved using Rapid Miner.
the discovery of natural clusters using salient features and
are applicable
only to
learning. To
demonstrate the usefulness of these three qualitative
principles, we use coincident quantitative measurements to
analyze the salient features for discovering clusters. The
experiments on the cancer (Diagnostic) and cancer
(Original) datasets demonstrate that the selected features
are effective for selecting salient features to discover
natural clusters. Based on a performance evaluation using
well-known validations in statistical model and cluster
analysis, our analysis provides an interesting aspect in
feature selection for discovering clusters.
Table Recurrence
Result of data train
91.5 %
Result of data test
89.7 %
LVI as a table that shoe is the best feature for predicting effect
Train data97%
Label: LVI
Test Data91%
Feature selection is one of the most effective methods to
enhance data representation and improve performance in terms
of specified criteria, e.g., generalization classification accuracy.
In the literature, many studies select a subset of salient features
using supervised learning rather than unsupervised learning.
When the class labels are absent during training, feature
selection in unsupervised learning is integral, but its extensible
application is rarely studied in the literature. The objective of
this study is to select salient features that can be used to
identify interesting clusters in the analysis of cancer diagnosis.
Specifically, we highlight three qualitative principles that help
users to analyze clinical cancer diagnosis using clusters
resulting from a subset of salient features. First, the clusters
built by a subset of salient features are more practical and
interpretable than those built by all of the features, which
include noise. Second, the clustering results provide clinical
doctors with an understanding of the context of clinical
cancer diagnoses. Finally, a search for relevant records
based on the clusters obtained when noisy features are
ignored is more efficient. These three principles rely on
M. C. Tayade and M. P. M. Karandikar, “Role of Data Mining
Techniques in Healthcare sector in India,” Sch. J. Appl. Med. Sci.
SJAMS, vol. 1, no. 3, pp. 158–160, 2013.
S. Chakrabarti, M. Ester, U. Fayyad, J. Gehrke, J. Han, S. Morishita, G.
Piatetsky-Shapiro, and W. Wang, “Data mining curriculum: A proposal
(Version 0.91),” 2004.
I. H. Witten and E. Frank, Data Mining: Practical machine learning
tools and techniques. Morgan Kaufmann, 2005.
M. Kantardzic, Data mining: concepts, models, methods, and
algorithms. John Wiley & Sons, 2011.
T. M. Mitchell, “Machine learning and data mining,” Commun. ACM,
vol. 42, no. 11, pp. 30–36, 1999.
S. B. Kotsiantis, “Supervised machine learning: a review of
classification techniques.,” Inform. 03505596, vol. 31, no. 3, 2007.
S. G. Jacob and R. G. Ramani, “Efficient Classifier for Classification of
Prognostic Breast Cancer Data through Data Mining Techniques,” in
Proceedings of the World Congress on Engineering and Computer
Science, 2012, vol. 1.
X. Wu and V. Kumar, The top ten algorithms in data mining. CRC
Press, 2009.
A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: a review,”
ACM Comput. Surv. CSUR, vol. 31, no. 3, pp. 264–323, 1999.
S. Aruna, D. S. Rajagopalan, and L. V. Nandakishore, “Knowledge
based analysis of various statistical tools in detecting breast cancer,”
Comput. Sci. Inf. Technol. CSIT, vol. 2, pp. 37–45, 2011.
A. Christobel and Y, Dr.Sivaprakasam, “An Empirical Comparison of
Data Mining Classification Methods.,” Int. J. Comput. Inf. Syst., vol. 3,
No 2011.
D. Lavanya and D. K. U. Rani, “Analysis of feature selection with
classification: Breast cancer datasets,” Indian J. Comput. Sci. Eng.
IJCSE, 2011.
E. Osuna, R. Freund, and F. Girosi, “Training support vector machines:
an application to face detection,” in Computer Vision and Pattern
Recognition, 1997. Proceedings., 1997 IEEE Computer Society
Conference on, 1997, pp. 130–136.
V. N. Chunekar and H. P. Ambulgekar, “Approach of Neural Network
to Diagnose Breast Cancer on three different Data Set,” in Advances in
Recent Technologies in Communication and Computing, 2009.
ARTCom’09. International Conference on, 2009, pp. 893–895.
D. Lavanya and K. Usha Rani, “ENSEMBLE DECISION TREE
Converg. Serv., vol. 2, no. 1, 2012.
B. Šter and A. Dobnikar, Neural network in medical diagnosis:
comparison with other methods. 1996.
T. Joachims, “Transductive inference for text classification using
support vector machines,” in ICML, 1999, vol. 99, pp. 200–209.
J. Abonyi and F. Szeifert, “Supervised fuzzy clustering for the
identification of fuzzy classifiers,” Pattern Recognit. Lett., vol. 24, no.
14, pp. 2195–2207, 2003.
K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, and others undefined,
“Constrained k-means clustering with background knowledge,” in
ICML, 2001, vol. 1, pp. 577–584.
Scientific Cooperations International Workshops on Electrical and Computer Engineering Subfields
22-23 August 2014, Koc University, ISTANBUL/TURKEY
[20] J. Shi and Z. Luo, “Nonlinear dimensionality reduction of gene
expression data for visualization and clustering analysis of cancer tissue
samples,” Comput. Biol. Med., vol. 40, no. 8, pp. 723–732, Aug. 2010.
[21] J. B. Tenenbaum, V. De Silva, and J. C. Langford, “A global geometric
framework for nonlinear dimensionality reduction,” Science, vol. 290,
no. 5500, pp. 2319–2323, 2000.
[22] G. Krishnasamy, A. J. Kulkarni, and R. Paramesran, “A hybrid
approach for data clustering based on modified cohort intelligence and
K-means,” Expert Syst. Appl., vol. 41, no. 13, pp. 6009–6016, Oct.
[23] B. Zheng, S. W. Yoon, and S. S. Lam, “Breast cancer diagnosis based
on feature extraction using a hybrid of K-means and support vector
machine algorithms,” Expert Syst. Appl., vol. 41, no. 4, Part 1, pp.
1476–1482, Mar. 2014.
[24] V. Chaurasia and S. Pal, Data Mining Techniques: To Predict and
Resolve Breast Cancer Survivability. IJCSMC, 2014.

Similar documents


Report this document