Data Mining: Concepts and Techniques (2nd
Jiawei Han and Micheline Kamber
Morgan Kaufmann Publishers, 2006
Bibliographic Notes for Chapter 2
Data preprocessing is discussed in a number of textbooks, including English [Eng99], Pyle [Pyl99], Loshin [Los01],
Redman [Red01], and Dasu and Johnson [DJ03]. More specific references to individual preprocessing techniques
are given below.
Methods for descriptive data summarization have been studied in the statistics literature long before the onset
of computers. Good summaries of statistical descriptive data mining methods include Freedman, Pisani and
Purves [FPP97], and Devore [Dev95]. For statistics-based visualization of data using boxplots, quantile plots,
quantile-quantile plots, scatter plots, and loess curves, see Cleveland [Cle93].
For discussion regarding data quality, see Redman [Red92], Wang, Storey, and Firth [WSF95], Wand and
Wang [WW96], Ballou and Tayi [BT99], and Olson [Ols03]. Potter’s Wheel (http://control.cx.berkely.edu/abc),
the interactive data cleaning tool described in Section 2.3.3, is presented in Raman and Hellerstein [RH01]. An
example of the development of declarative languages for the specification of data transformation operators is given
in Galhardas, Florescu, Shasha, et al. [GFS+ 01]. The handling of missing attribute values is discussed in Friedman
[Fri77], Breiman, Friedman, Olshen, and Stone [BFOS84], and Quinlan [Qui89]. A method for the detection of
outlier or “garbage” patterns in a handwritten character database is given in Guyon, Matic, and Vapnik [GMV96].
Binning and data normalization are treated in many texts, including Kennedy, Lee, Van Roy, et al. [KLV+ 98],
Weiss and Indurkhya [WI98], and Pyle [Pyl99]. Systems that include attribute (or feature) construction include
BACON by Langley, Simon, Bradshaw, and Zytkow [LSBZ87], Stagger by Schlimmer [Sch86], FRINGE by Pagallo
[Pag89], and AQ17-DCI by Bloedorn and Michalski [BM98]. Attribute construction is also described in Liu and
Motoda [LM98, Le98]. Dasu, Johnson, Muthukrishnan, and Shkapenyuk [DJMS02] developed a system called
Bellman wherein they propose a set of methods for building a data quality browser by mining on the structure of
A good survey of data reduction techniques can be found in Barbará, Du Mouchel, Faloutos, et al. [BDF+ 97].
For algorithms on data cubes and their precomputation, see Sarawagi and Stonebraker [SS94], Agarwal, Agrawal,
Deshpande, et al. [AAD+ 96], Harinarayan, Rajaraman, and Ullman [HRU96], Ross and Srivastava [RS97], and
Zhao, Deshpande, and Naughton [ZDN97]. Attribute subset selection (or feature subset selection) is described
in many texts, such as Neter, Kutner, Nachtsheim, and Wasserman [NKNW96], Dash and Liu [DL97], and Liu
and Motoda [LM98, LM98b]. A combination forward selection and backward elimination method was proposed in
Siedlecki and Sklansky [SS88]. A wrapper approach to attribute selection is described in Kohavi and John [KJ97].
Unsupervised attribute subset selection is described in Dash, Liu, and Yao [DLY97]. For a description of wavelets
for dimensionality reduction, see Press, Teukolosky, Vetterling, and Flannery [PTVF96]. A general account of
wavelets can be found in Hubbard [Hub96]. For a list of wavelet software packages, see Bruce, Donoho, and Gao
[BDG96]. Daubechies transforms are described in Daubechies [Dau92]. The book by Press, et al. [PTVF96]
includes an introduction to singular value decomposition for principal components analysis. Routines for PCA are
included in most statistical software packages, such as SAS (www.sas.com/SASHome.html ).
An introduction to regression and log-linear models can be found in several textbooks, such as James [Jam85],
Dobson [Dob01], Johnson and Wichern [JW02], Devore [Dev95], and Neter, Kutner, Nachtsheim, and Wasserman
[NKNW96]. For log-linear models (known as multiplicative models in the computer science literature), see Pearl
[Pea88]. For a general introduction to histograms, see Barbará et al. [BDF+ 97] and Devore and Peck [DP97].
For extensions of single attribute histograms to multiple attributes, see Muralikrishna and DeWitt [MD88] and
Poosala and Ioannidis [PI97]. Several references to clustering algorithms are given in Chapter 7 of this book,
Data Mining: Concepts and Techniques
Han and Kamber, 2006
which is devoted to the topic. A survey of multidimensional indexing structures is given in Gaede and Günther
[GG98]. The use of multidimensional index trees for data aggregation is discussed in Aoki [Aok98]. Index trees
include R-trees (Guttman [Gut84]), quad-trees (Finkel and Bentley [FB74]), and their variations. For discussion
on sampling and data mining, see Kivinen and Mannila [KM94] and John and Langley [JL96].
There are many methods for assessing attribute relevance. Each has its own bias. The information gain measure
is biased towards attributes with many values. Many alternatives have been proposed, such as gain ratio (Quinlan
[Qui93]), which considers the probability of each attribute value. Other relevance measures include the gini
index (Breiman, Friedman, Olshen, and Stone [BFOS84]), the χ2 contingency table statistic, and the uncertainty
coefficient (Johnson and Wichern [JW02]). For a comparison of attribute selection measures for decision tree
induction, see Buntine and Niblett [BN92]. For additional methods, see Liu and Motoda [LM98]b, Dash and Liu
[DL97], and Almuallim and Dietterich [AD91].
Liu, Hussain, Tan, and Dash [LHTD02] performed a comprehensive survey of data discretization methods.
Entropy-based discretization with the C4.5 algorithm is described in Quinlan [Qui93]. In Catlett [Cat91], the D-2
system binarizes a numerical feature recursively. ChiMerge by Kerber [Ker92] and Chi2 by Liu and Setiono [LS95]
are methods for the automatic discretization of numerical attributes that both employ the χ2 statistic. Fayyad and
Irani [FI93] apply the minimum description length principle to determine the number of intervals for numerical
discretization. Concept hierarchies and their automatic generation from categorical data are described in Han and
[AAD+ 96] S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and
S. Sarawagi. On the computation of multidimensional aggregates. In Proc. 1996 Int. Conf. Very
Large Data Bases (VLDB’96), pages 506–521, Bombay, India, Sept. 1996.
H. Almuallim and T. G. Dietterich. Learning with many irrelevant features. In Proc. 1991 Nat. Conf.
Artificial Intelligence (AAAI’91), pages 547–552, Anaheim, CA, July 1991.
P. M. Aoki. Generalizing “search” in generalized search trees. In Proc. 1998 Int. Conf. Data Engineering (ICDE’98), pages 380–389, Orlando, FL, Feb. 1998.
D. Barbará, W. DuMouchel, C. Faloutos, P. J. Haas, J. H. Hellerstein, Y. Ioannidis, H. V. Jagadish,
T. Johnson, R. Ng, V. Poosala, K. A. Ross, and K. C. Servcik. The New Jersey data reduction report.
Bull. Technical Committee on Data Engineering, 20:3–45, Dec. 1997.
A. Bruce, D. Donoho, and H.-Y. Gao. Wavelet analysis. In IEEE Spectrum, pages 26–35, Oct 1996.
L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth
International Group, 1984.
E. Bloedorn and R. S. Michalski. Data-driven constructive induction: A methodology and its applications. In H. Liu H. Motoda, editor, Feature Selection for Knowledge Discovery and Data Mining.
Kluwer Academic, 1998.
W. L. Buntine and T. Niblett. A further comparison of splitting rules for decision-tree induction.
Machine Learning, 8:75–85, 1992.
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Comm. ACM,
J. Catlett. Megainduction: Machine Learning on Very large Databases. Ph.D. Thesis, University of
W. Cleveland. Visualizing Data. Hobart Press, 1993.
I. Daubechies. Ten Lectures on Wavelets. Capital City Press, 1992.
J. L. Devore. Probability and Statistics for Engineering and the Science (4th ed.). Duxbury Press,
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003.
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or how
to build a data quality browser. In Proc. 2002 ACM-SIGMOD Int. Conf. on Management of Data
(SIGMOD’02), pages 240–251, Madison, WI, June 2002.
M. Dash and H. Liu. Feature selection methods for classification. Intelligent Data Analysis, 1:131–156,
Data Mining: Concepts and Techniques
Han and Kamber, 2006
M. Dash, H. Liu, and J. Yao. Dimensionality reduction of unsupervised data. In Proc. 1997 IEEE Int.
Conf. Tools with AI (ICTAI’97), pages 532–539, IEEE Computer Society, 1997.
A. J. Dobson. An Introduction to Generalized Linear Models (2nd ed.). Chapman and Hall, 2001.
J. Devore and R. Peck. Statistics: The Exploration and Analysis of Data. Duxbury Press, 1997.
L. English. Improving Data Warehouse and Business Information Quality: Methods for Reducing
Costs and Increasing Profits. John Wiley & Sons, 1999.
R. A. Finkel and J. L. Bentley. Quad-trees: A data structure for retrieval on composite keys. ACTA
Informatica, 4:1–9, 1974.
U. Fayyad and K. Irani. Multi-interval discretization of continuous-values attributes for classification
learning. In Proc. 1993 Int. Joint Conf. Artificial Intelligence (IJCAI’93), pages 1022–1029, Chambery,
D. Freedman, R. Pisani, and R. Purves. Statistics (3rd ed.). W. W. Norton & Co., 1997.
J. H. Friedman. A recursive partitioning decision rule for nonparametric classifiers. IEEE Trans. on
Comp., 26:404–408, 1977.
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language,
model, and algorithms. In Proc. 2001 Int. Conf. on Very Large Data Bases (VLDB’01), pages 371–380,
Rome, Italy, Sept. 2001.
V. Gaede and O. Günther. Multidimensional access methods. ACM Comput. Surv., 30:170–231, 1998.
I. Guyon, N. Matic, and V. Vapnik. Discoverying informative patterns and data cleaning. In U. M.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 181–203. AAAI/MIT Press, 1996.
A. Guttman. R-tree: A dynamic index structure for spatial searching. In Proc. 1984 ACM-SIGMOD
Int. Conf. Management of Data (SIGMOD’84), pages 47–57, Boston, MA, June 1984.
J. Han and Y. Fu. Dynamic generation and refinement of concept hierarchies for knowledge discovery in
databases. In Proc. AAAI’94 Workshop Knowledge Discovery in Databases (KDD’94), pages 157–168,
Seattle, WA, July 1994.
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes efficiently. In Proc. 1996
ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’96), pages 205–216, Montreal, Canada,
B. B. Hubbard. The World According to Wavelets. A. K. Peters, 1996.
M. James. Classification Algorithms. John Wiley & Sons, 1985.
G. H. John and P. Langley. Static versus dynamic sampling for data mining. In Proc. 1996 Int. Conf.
Knowledge Discovery and Data Mining (KDD’96), pages 367–370, Portland, OR, Aug. 1996.
R. A. Johnson and D. A. Wichern. Applied Multivariate Statistical Analysis (5th ed.). Prentice Hall,
R. Kerber. Discretization of numeric attributes. In Proc. 1992 Nat. Conf. Artificial Intelligence
(AAAI’92), pages 123–128, AAAI/MIT Press, 1992.
R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97:273–324,
R. L Kennedy, Y. Lee, B. Van Roy, C. D. Reed, and R. P. Lippman. Solving Data Mining Problems
Through Pattern Recognition. Prentice Hall, 1998.
Chapter 2 Data Preprocessing
J. Kivinen and H. Mannila. The power of sampling in knowledge discovery. In Proc. 13th ACM Symp.
Principles of Database Systems, pages 77–85, Minneapolis, MN, May 1994.
H. Liu and H. Motoda (eds.). Feature Extraction, Construction, and Selection: A Data Mining Perspective. Kluwer Academic, 1998.
[LHTD02] H. Liu, F. Hussain, C. L. Tan, and M. Dash. Discretization: An enabling technique. Data Mining and
Knowledge Discovery, 6:393–423, 2002.
H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic,
D. Loshin. Enterprise Knowledge Management: The Data Quality Approach. Morgan Kaufmann,
H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In Proc. 1995
IEEE Int. Conf. Tools with AI (ICTAI’95), pages 388–391, Washington, DC, Nov. 1995.
P. Langley, H. A. Simon, G. L. Bradshaw, and J. M. Zytkow. Scientific Discovery: Computational
Explorations of the Creative Processes. MIT Press, 1987.
M. Muralikrishna and D. J. DeWitt. Equi-depth histograms for extimating selectivity factors for multidimensional queries. In Proc. 1988 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’88),
pages 28–36, Chicago, IL, June 1988.
[NKNW96] J. Neter, M. H. Kutner, C. J. Nachtsheim, and L. Wasserman. Applied Linear Statistical Models (4th
ed.). Irwin, 1996.
J. E. Olson. Data Quality: The Accuracy Dimension. Morgan Kaufmann, 2003.
G. Pagallo. Learning DNF by decision trees. In Proc. 1989 Int. Joint Conf. Artificial Intelligence
(IJCAI’89), pages 639–644, Morgan Kaufmann, 1989.
J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kauffman, 1988.
V. Poosala and Y. Ioannidis. Selectivity estimation without the attribute value independence assumption. In Proc. 1997 Int. Conf. Very Large Data Bases (VLDB’97), pages 486–495, Athens, Greece,
[PTVF96] W. H. Press, S. A. Teukolosky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes in C: The
Art of Scientific Computing. Cambridge University Press, 1996.
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
J. R. Quinlan. Unknown attribute values in induction. In Proc. 1989 Int. Conf. Machine Learning
(ICML’89), pages 164–168, Ithaca, NY, June 1989.
J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
T. Redman. Data Quality: Management and Technology. Bantam Books, 1992.
T. Redman. Data Quality: The Field Guide. Digital Press (Elsevier), 2001.
V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In Proc. 2001
Int. Conf. on Very Large Data Bases (VLDB’01), pages 381–390, Rome, Italy, Sept. 2001.
K. Ross and D. Srivastava. Fast computation of sparse datacubes. In Proc. 1997 Int. Conf. Very Large
Data Bases (VLDB’97), pages 116–125, Athens, Greece, Aug. 1997.
J. C. Schlimmer. Learning and representation change. In Proc. 1986 Nat. Conf. Artificial Intelligence
(AAAI’86), pages 511–515, Philadelphia, PA, 1986.
Data Mining: Concepts and Techniques
Han and Kamber, 2006
W. Siedlecki and J. Sklansky. On automatic feature selection. Int. J. Pattern Recognition and Artificial
Intelligence, 2:197–220, 1988.
S. Sarawagi and M. Stonebraker. Efficient organization of large multidimensional arrays. In Proc. 1994
Int. Conf. Data Engineering (ICDE’94), pages 328–336, Houston, TX, Feb. 1994.
S. M. Weiss and N. Indurkhya. Predictive Data Mining. Morgan Kaufmann, 1998.
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623–640, 1995.
Y. Wand and R. Wang. Anchoring data quality dimensions in ontological foundations. Comm. ACM,
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for simultaneous multidimensional aggregates. In Proc. 1997 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’97),
pages 159–170, Tucson, AZ, May 1997.