English

not defined

no text concepts found

Data Mining - Motivation "Computers have promised us a fountain of wisdom but delivered a flood of data." "It has been estimated that the amount of information in the world doubles every 20 months." (Frawley, Piatetsky-Shapiro, Matheus, 1992) 1 © J. Fürnkranz Knowledge Discovery in Databases (KDD) Mining for nuggets of knowledge in mountains of Data. 2 © J. Fürnkranz Definition ● Data Mining is a non-trivial process of identifying ● ● ● ● valid novel potentially useful ultimately understandable It employs techniques from ● machine learning ● statistics ● databases patterns in data. (Fayyad et al. 1996) Or maybe: ● Data Mining is torturing your database until it confesses. (Mannila (?)) 3 © J. Fürnkranz Knowledge Discovery in Databases: Key Steps Key steps in the Knowledge Discovery cycle: 1. Data Cleaning: remove noise and incosistent data 2. Data Integration: combine multiple data sources 3. Data Selection: select the part of the data that are relevant for the problem 4. Data Transformation: transform the data into a suitable format (e.g., a single table, by summary or aggregation operations) 5. Data Mining: apply machine learning and machine discovery techniques 6. Pattern Evaluation: evaluate whether the found patterns meet the requirements (e.g., interestingness) 7. Knowledge Presentation: present the mined knowledge to the user (e.g., visualization) 4 © J. Fürnkranz Data Mining is a Process ! The steps are not followed linearly, but in an iterative process. Source: http://alg.ncsa.uiuc.edu/tools/docs/d2k/manual/dataMining.html, after Fayyad, Piatetsky-Shapiro, Smyth, 1996 5 © J. Fürnkranz Another Process Model Source: http://www.crisp-dm.org/ 6 © J. Fürnkranz Pre-Processing ● Databases are typically not made to support analysis with a data mining algorithm ● pre-processing of data is necessary Pre-processing techniques: Data Cleaning: remove inconsistencies from the data Feature Engineering: find the right features/attribute set ● ● ● Feature Subset Selection: select appropriate feature subsets Feature Transformation: bring attributes into a suitable form (e.g., discretization) Feature Construction: construct derived features Sampling: ● select appropriate subsets of the data 7 © J. Fürnkranz Unsupervised vs. Supervised Pre-processing ● Unsupervised do not use information about the learning task ● ● ● Supervised use information about the learning task ● ● only prior information (from knowledge about the data) and information about the distribution of the training data e.g.: look at relation of an attribute to class attribute WARNING: ● pre-processing may only use information from training data! ● compute pre-processing model from training data apply the model to training and test data otherwise information from test data may be captured in the preprocessing step → biased evaluation in particular: apply pre-processing to every fold in cross-validation 8 © J. Fürnkranz Feature Subset Selection ● ● Databases are typically not collected with data mining in mind Many features may be ● Removing them can ● irrelevant uninteresting redundant increase efficiency improve accuracy prevent overfitting Feature Subsect Selection techniques try to determine appropriate features automatically 9 © J. Fürnkranz Unsupervised FSS ● Using domain knowledge ● some features may be known to be irrelevant, uninteresting or redundant Random Sampling select a random sample of the feature may be appropriate in the case of many weakly relevant features and/or in connection with ensemble methods 10 © J. Fürnkranz Supervised FSS ● Filter approaches: compute some measure for estimating the ability to discriminate between classes typically measure feature weight and select the best n features problems ● ● ● redundant features (correlated features will all have similar weights) dependent features (some features may only be important in combination (e.g., XOR/parity problems). Wrapper approaches search through the space of all possible feature subsets each search subset is tried with the learning algorithm 11 © J. Fürnkranz Supervised FSS: Filters ● foreach attribute A W[A] = feature weight according to some measure of discrimination ● ● e.g., decision tree splitting criteria (entropy/information gain, giniindex, ...), attribute weighting criteria (Relief, ...), etc. select the n features with highest W[A] Basic idea: ● a good attribute should discriminate between the different classes ● use a measure of discrimination to determine which attributes to select Disadvantage: ● quality of each attribute is measured in isolation ● some attributes may only be useful in combination with others 12 © J. Fürnkranz RELIEF (Kira & Rendell, ICML-92) Basic idea: ● in a local neighborhood around an example R a good attribute A should allow to discriminate R from all examples of different classes (the set of misses) ● therefore the probability that the attribute has a different value for R and a miss M should be high have the same value for all examples of the same class as R (the set of hits) ● therefore the probability that the attribute has a different value for R and a hit H should be low → try to estimate and maximize W [ A]=P a R ≠a M −P a R ≠a H where aX is the value of attribute A in example X 13 © J. Fürnkranz RELIEF (Kira & Rendell, ICML-92) ● ● set all attribute weights W[A] = 0.0 for i = 1 to m (← user-settable parameter) select a random example R find ● ● H: nearest neighbor of same class (near hit) M: nearest neigbor of different class (near miss) for each attribute A ● d A , H , R d A , M , R W [ A] W [ A]− m m where d(A,X,Y) is the distance in attribute A between examples X and Y (normalized to [0,1]-range). 14 © J. Fürnkranz FSS: Wrapper Approach (John, Kohavi, Pfleger, ICML-94) ● Wrapper Approach: try a feature subset with the learner improve it by modifying the feature sets based on the result repeat Figure by Kohavi & John 15 © J. Fürnkranz FSS: Wrapper Approach ● Forward selection: 1. start with empty feature set F 2. for each attribute A a) F = F ∪ {A} b) Estimate Accuracy of Learning algorithm on F c) F = F \ {A} 3. F = F ∪ {attribute with highest estimated accuracy} 4. goto 2. unless estimated accuracy decreases significantly ● Backward elimination: ● start with full feature set F try to remove attributes Bi-directional search is also possible 16 © J. Fürnkranz Example: Forward Search Attrs: current set of attributes Est: accuracy estimated by wrapper Real: „real“ accuracy Figure by John, Kohavi & Pfleger 17 © J. Fürnkranz Properties ● Advantage: ● find feature set that is tailored to learning algorithm considers combinations of features, not only individual feature weights can eliminate redundant features (picks only as many as the algorithm needs) Disadvantage: very inefficient: many learning cycles necessary 18 © J. Fürnkranz Comparison Wrapper / Relief Note: RelieveD is a version of Relief that uses all examples instead of a random sample ● on these datasets: ● forward selection reduces attributes w/o error increase in general, it may also reduce error Figure by John, Kohavi & Pfleger 19 © J. Fürnkranz Feature Transformation ● bring features into a usable form ● numerization some algorithms can only use numeric data nominal → binary ● binary → numeric ● ● a nominal attribute with n values is converted into n binary attributes binary features may be viewed as special cases of numeric attributes with two values discretization some algorithms can only use categorical data ● transform numeric attributes into a number of (ordered) categorical values 20 © J. Fürnkranz Discretization ● Supervised vs. Unsupervised: Unsupervised: ● only look at the distribution of values of the attribute Supervised: ● also consider the relation of attribute values to class values ● Merging vs. Splitting: Merging (bottom-up discretization): ● Start with a set of intervals (e.g., each point is an interval) and successively combine neighboring intervals Splitting (top-down discretization): ● Start with a single interval and successively split the interval into sub-intervals 21 © J. Fürnkranz Unsupervised Discretization ● domain-dependent: ● ● ● equal-width: ● ● ● suitable discretizations are often known age (0-18) → baby (0-3), child (3-6), school child (6-10), teenager (11-18) divide value range into a number of intervals with equal width age (0,18) → (0-3, 4-7, 8-11, 12-15, 16-18) equal-frequency: ● ● ● divide value range into a number of intervals so that (approximately) the same number of datapoints are in each interval e.g., N = 5: each interval will contain 20% of the training data good for non-uniform distributions (e.g., salary) 22 © J. Fürnkranz Supervised Discretization: Chi-Merge (Kerber, AAAI-92) Basic Idea: merge neighboring intervals if the class information is independent of the interval an example belongs to ● initialization: ● sort examples according to feature value construct one interval for each value interval merging: compute 2 value for each pair of adjacent intervals n c 2 2 c C A −E N i=∑ A ij C j = ∑ Aij E ij = N i j 2=∑ ∑ ij ij N E ij j =1 i=1 i =1 j=1 intervals Aij = number of examples in i-th interval that are of class j Eij = expected number of examples in i-th interval that are of class j = examples in i-th interval Ni × fraction Cj/N of (all) examples of class j ● merge those with lowest 2 value when the 2 values of all pairs exceed a significance threshold stop 23 © J. Fürnkranz Supervised Discretization: Entropy-Split (Fayyad & Irani, IJCAI-93) Basic Idea: grow a decision tree using a single numeric attribute and use the value ranges in the leaves as ordinal values ● initialization: initialize intervals with a single interval covering all examples S sort all examples according to the attribute value initialize the set of possible split points ● simple: all values more efficient: only between class changes in sorted list interval splitting: select split point with the minimum weighted entropy ∣S AT∣ ∣S A≥T∣ T max =arg min Entropy S A T Entropy S A≥T ∣S∣ ∣S∣ T ● recursively apply Entropy-Split to S AT stop max and S A≥T max when a given number of splits is achieved or when splitting would yield too small intervals or MDL-based stopping criterion (Fayyad & Irani, 1993) 24 © J. Fürnkranz Example Temperature 64 65 68 70 71 72 72 75 80 81 83 85 Play Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No Slide taken from Witten & Frank 69 25 75 © J. Fürnkranz Unsupervised Feature Construction ● based on domain knowledge ● weight kg BMI = height m2 Example: Body Mass Index automatic Examples: ● kernel functions ● principal components analysis ● may be viewed as feature construction modules → support vector machines transforms an n-dimensional space into a lower-dimensional subspace w/o losing much information GLEM: uses an Apriori -like algorithms to compute all conjunctive combinations of basic features that occur at least n times application to constructing evaluation functions for game Othello 26 © J. Fürnkranz Supervised Feature Construction ● use the class information to construct features that help to solve the classification problem ● Examples: Wrapper approach ● ● ● use rule or decision tree learning algorithm observe frequently co-occurring features or feature values encode them as separate features Neural Network ● nodes in hidden layers may be interpreted as constructed features 27 © J. Fürnkranz Scalability ● databases are often too big for machine learning algorithms ● ML algorithms require frequent counting operations and multidimensional access to data only feasible for data that can be held in main memory two strategies to make DM algorithms scalable design algorithms that are explicitly targetted towards minimizing the number of database operations (e.g., Apriori) use sampling to work on subsets of the data 28 © J. Fürnkranz Sampling ● Random Sampling ● Select a random subset of the data Progressive Sampling start with a small sample increase sample size ● ● ● arithmetic: increase sample size by fixed number of examples geometric: multiply size with a fixed number (e.g., double size) stop when convergence is detected Sequential sampling rule out solution candidates based on significant differences 29 © J. Fürnkranz Windowing ● Idea: ● focus the learner on the parts of the search space that are not yet correctly covered Algorithm: 1. Initialize the window with a random subsample of the available data 2. Learn a theory from the current window 3. If the learned theory correctly classifies all examples (including those outside of the window), return the theory 4. Add some mis-classified examples to the window and goto 2. ● Properties: may learn a good theory from a subset of the data problems with noisy data 30 © J. Fürnkranz Weiterführende Lehrveranstaltungen ● Computer Poker Challenge besteht aus: ● ● ● ● ● ● ● ● ● Seminar KE und ML in Spielen Praktikum aus Künstliche Intelligenz ACHTUNG: Beginn in der 2. März-Woche! Teilnahme an einem Internationalen Computer Poker-Wettbewerb Maschinelles Lernen: Statistische Verfahren 1 + 2 (Roth/Schiele) Neural Networks (Stibor) Einführung in die Künstliche Intelligenz (Fürnkranz, 3+1) Web Mining (erst wieder SS09) Data und Knowledge Engineering (A. Buchmann, Fürnkranz) Seminare (wechselnde Themen, z.B. Mining in Graphs). Hiwis gesucht! 31 © J. Fürnkranz