Data Mining




Knowledge Discovery can be seen as the process of identifying novel, useful and understandable patters in large data sets. The goal of classification [1] is to predict the value (the class) of a user-specified goal attribute based on the values of other attributes, the so-called predicting attributes.


Classification rules can be considered a particular kind of prediction rules where the rule antecedent (“IF part”) contains a combination - typically, a conjunction - of conditions on predicting attribute values, and the rule consequent (“THEN part”) contains a predicted value for the goal attribute.


Complete classification may be infeasible when there are a very large number of class attributes. Partial classification, known as nugget discovery, seeks to find patterns that represent a “strong” description of a particular class. The consequent is fixed to be a particular named class. Given a record t, antecedent(t) is true if t satisfies the predicate antecedent. Similarly consequent(t) is true if t satisfies the predicate consequent. The subsets defined by the antecedent or consequent are the sets of records for which the relevant predicate is true. Three sets of records are defined [1]:


1. A = {t Î D | antecedent(t)} i.e. the set of records defined by the antecedent,

2. B = {t Î D | consequent(t)} i.e. the set of records defined by the consequent,

3. C = {t Î D | antecedent(t) and consequent(t)},


The cardinality of these sets are a,b and c respectively. The confidence conf(r) and the coverage cov(r) of a rule r are:


* conf(r ) = sup(r )/ sup (ra) = c/a


* cov(r ) = sup(r )/ sup (rc) = c/b


A strong rule may be defined as one that meets certain confidence and coverage thresholds normally set by a user.


Partial Classification


Iglesia et al [1] used NSGA-II for nugget discovery. An alternative algorithm ARAC, which can deliver the Pareto global optimal front of all partial classification rules above a specified confidence/coverage threshold, was used for the analysis of the NSGA-II results.


The objectives used are conf(r) and conf(r).


The antecedent comprises a conjunction of Attribute Tests, ATs. A binary encoded string is used to represent the solution as a conjunctive rule. The first part of the string represents the m nominal attributes. Each numeric attribute is represented by a set of Gray-coded lower and upper limits using 10 bits. For all attributes, when the data is loaded, the maximum and minimum values are calculated and stored.


The second part of the string represents categorical attributes, with as many bits for each attribute as distinct values the categorical attribute can take. If a bit assigned to a categorical attribute is set to 0 then the corresponding label is included as an inequality in one of the conjuncts.


To evaluate a solution, the bit string is first decoded, and the data in the database is scanned. For a database with n attributes, the ATs for nominal attributes can be expressed in various forms such as:


ATj = v where v is a value from the domain of ATj, for some 1 ≤ j ≤ n. A database record x meets this simple value test if x[ATj] = v.


ATj ≠ v for some 1 ≤ j ≤ n. A record x meets this inequality test if x[ATj] ≤ v.


A bit string decoded represents to the following format:


IF l1 ≤ AT1 ≤ u1 ÙÙ lm ≤ ATm ≤ um Ù ATm+1 ≠ lm+1 ÙÙ ATn ≠ ln THEN c


where l1 is given by the first p bits of the binary string u1 is given by the following p bits, etc. If a lower limit for the i-th attribute is set to its lowest possible attribute value or the upper limit is set to its highest possible value then there is no need to include that limit in the decoded nugget. If a categorical attribute has a value of 1 for all the bits allocated to its labels then there is no need to include the attribute. For each record the values of the fields are compared against the nuggets, and the class is also compared. The values c, a are updated accordingly whereas b and d are evaluated at the data loading stage. Once all the records have been examined the coverage and confidence are calculated for each nugget.


The initialization of the population with random solutions provided many solutions of very poor quality. The performance of NSGA-II improved if all solutions were initialized with the default rule with some of the bits mutated. The default rule is the rule that predicts the class without any pre-conditions.


Good performance for NSGA-II was obtained for a crossover probability in the range 0.6--0.8. The results are sensitive on the mutation probability and a value of 0.02 was optimal. The optimization requires 80--100 generations using a population size of 120--140 solutions.


The algorithm was applied among other dataset to a Contraceptive Method database, a subset of a contraceptive prevalence survey. The samples are married women who were either not pregnant or did not know if they were at the time of the interview. The problem is to predict the current contraceptive method choice (42% “no use”, 23% “long-term methods”, or 35% “short-term methods” cases) of a woman based on demographic and socio-economic characteristics. The results of NSGA-II were compared with results obtained with ARAC which is able to find all non-dominated solutions, thus the global optimal Pareto front. For large databases the computational time increases rapidly and the complexity is such that its actual computational time is unknown. The classification time by MOEA is proportional to the database size, but the number of rules found is limited by the population size. The results showed that NSGA-II fairly well reproduced the non-dominated front obtained by ARAC.


For large databases, MOEA can be used to find a good approximation of the Pareto optimal set of rules. ARAC can be used to find an initial set of rules, to be used for the initialization of the MOEA algorithm. For large databases, the initial set of solutions found by ARAC may be a constrained set, but then MOEA can be used to drive the search further without constraints. NSGA-II and ARAC can be used in combination for knowledge discovery for large databases for the partial classification task.



References


[1] B. de la Iglesia, G. Richards, M. S. Philpott and V. J. Rayward-Smith, The application and effectiveness of a multi-objective metaheuristic algorithm with respect to data mining task of partial classification, submitted for publication to European Journal of Operational Research, 2003.


BACK