About IlliMine Demo Download Download


Features

The IlliMine project is an open-source project to provide various tools for data mining and machine learning. Currently, the project features algorithms in the following set of areas: Data Cubing, Association Mining, Sequential Mining, Other. Online demos of some of the algorithms may be found here. Other resources (publications, README's, etc) are available here.


Data Cubing

Data cubing is process of computing the set of all possible group-by's from a base table. It facilitates many OLAP operations such as drilling-down or roll-up. Our package features several algorithms for computing the full data cube (all group-by's), iceberg cubes (group-by's satisfying a minimum support value), and closed cubes (group-by's which are not subsumed by any other group-by's).

BUC

BUC is "Bottom-Up Computation" of full data cubes or iceberg cubes. The algorithm was published by Beyer and Ramakrishnan in SIGMOD'99. It uses the so-called Apriori concept to compute iceberg cubes. That is, if a group-by cannot satisfy the minimum support, none of the group-by's that are more general than it can satisfy the minimum support either. And thus, the computation can stop along that particular group-by branch.

Star-Cubing

Star-Cubing uses a hyper-tree structure, called Star-Tree to facilitate either full or iceberg cube computation. Each level in the tree represents a dimension in the base cuboid. The algorithm takes the advantages of the top-down and bottom-up models. On the global computation order, it uses the simultaneous aggregation similar to MultiWay. However, it has a sub-layer underneath based on the bottom-up model by exploring the notion of shared dimension. This integration allows the algorithm to aggregate on multiple dimensions, while it can still partition parent group-by's and use Apriori-based pruning on child group-by's.
Star-Cubing also introduced a special value "Star" for attributes whose count is less than the minimum support, which has been proved useful for skewed data. It performs well on dense, skewed and not-so-sparse data.

Frag-Cubing

Frag-Cubing focuses on the problem of high-dimensional OLAP. Traditional cubing technologies quickly become ineffective when the number of dimensions in the data increases. This is a well-known deficiency. In Frag-Cubing, low dimensional shell-frags (ie, fragments of the high dimensional cube) are constructed. Combined with inverted indexes, they facilitate query processing in the original high-dimensional data cube.

MM-Cubing

MM-Cubing is a density-based method, where MM represents major and minor values. By distinguishing those values, the lattice space is factorized into one dense subspace and several sparse subspaces, and MM-Cubing applies specialized methods for each subspace. The dense subspace is computed by simultaneous aggregation. The other subspaces are solved by recursive calls.

C-Cubing

In this algorithm, C-Cubing, a new closedness-checking method is proposed. With a very small overhead, the closedness information of a cell can be aggregated as a measure, and the closedness-checking can be simply done by testing this special measure. The new approach is entitled aggregation-based c-checking. The new method C-Cubing is implemented in two successful iceberg cubing algorithms: MM-Cubing and Star-Cubing.

Association Mining

Association mining (frequent itemset mining) is one of the biggest problems in data mining. An association rule has the form X -> Y where X and Y are sets of items. The problem of finding association rules (with support and confidence thresholds) can be viewed as two separate sub-problems. The first is finding the set of frequent itemsets. The second is constructing the rules from the frequent itemsets.

Most association mining algorithms focus on the frequent pattern mining problem. Let I be a set of items, and a transaction database DB = {T_1, T_2, ..., T_n}, where T_i is a transaction which contains a set of items in I. The support (or occurrence frequency) of a pattern A, which is a set of items, is the number of transactions containing A in DB. A is a frequent pattern if A's support is no less than a predefined minimum support threshold. Given a transaction database DB and a minimum support threshold, the problem is to find the complete set of frequent patterns.

FPGrowth

A frequent pattern tree, or FP-tree for short, is constructed, which is an extended prex-tree structure storing crucial, quantitative information about frequent patterns. Only frequent length-1 items will have nodes in the tree, and the tree nodes are arranged in such a way that more frequently occurring nodes will have better chances of sharing nodes than less frequently occurring ones.
An FP-tree-based pattern fragment growth mining method, is developed, which starts from a frequent length-1 pattern (as an initial prefix pattern), examines only its conditional pattern base (a sub-database which consists of the set of frequent items co-occurring with the prefix pattern), constructs its (conditional) FP-tree, and performs mining recursively with such a tree. The pattern growth is achieved via concatenation of the prefix pattern with the new ones generated from a conditional FP-tree. Since the frequent itemset in any transaction is always encoded in the corresponding path of the frequent pattern trees, pattern growth ensures the completeness of the result.
In the IlliMine package, there are 2 different implementations of FPGrowth.

CLOSET+

An itemset A is a frequent closed itemset if it is frequent and there exists no proper superset A' such that A's support (frequency count) is equal to that of A.
CLOSET+ follows the popular divide-and-conquer paradigm and the depth-first search strategy which has been shown to be a winner for mining long patterns by several efficient frequent pattern mining algorithms. It uses FP-tree as the compression technique. A depth-first search and horizontal format-based method like CLOSET+ will compute the local frequent items of a certain prefix by building and scanning its projected database. A hybrid tree-projection method is introduced to improve the space efficiency.
Unlike frequent itemset mining, during the closed itemset mining process there may exist some prefix itemsets that are unpromising to be used to grow closed itemsets. Such unpromising prefix itemsets are detected and removed as quickly as possible. Other techniques are introduced to prune the search space and speed up the mining process.

RPMine

A major challenge in frequent pattern mining is the sheer size of its mining results. In many cases, a high minimum support threshold may discover only common sense patterns but a low one may generate an explosive number of output patterns, which severely restricts its usage.
This work studies the problem of compressing frequent-pattern sets. Typically, frequent patterns can be clustered with a tightness measure delta (called delta-cluster), and a representative pattern can be selected for each cluster. Unfortunately, finding a minimum set of representative patterns is NP-Hard. This works develops two greedy methods, RPglobal and RPlocal. The former has the guaranteed compression bound but higher computational complexity. The latter sacrifices the theoretical bounds but is far more efficient. Performance study shows that the compression quality using RPlocal is very close to RPglobal, and both can reduce the number of closed frequent patterns by almost two orders of magnitude.

Sequential Mining

Given a set of sequences, where each sequence consists of a list of elements and each element consists of a set of items, and given a user-specified minimum support threshold, sequential pattern mining is to find all of the frequent subsequences, i.e., the subsequences whose occurrence frequency in the set of sequences is no less than minimum support.

PrefixSpan

The general idea here is to examine only the prefix subsequences and project only their corresponding postfix subsequences into projected databases. In each projected database, sequential patterns are grown by exploring only local frequent patterns. To further improve mining efficiency, two kinds of database projections are explored: level-by-level projection and bi-level projection, and an optimization technique which explores pseudo-projection is developed.

CloSpan

Closed sequential patterns are frequent subsequences for which there are no super-sequences that have the same support. The CloSpan algorithm aims to discover such patterns.

IncSpan

Many real life sequence databases grow incrementally. It is undesirable to mine sequential patterns from scratch each time when a small set of sequences grow, or when some new sequences are added into the database. Incremental algorithm should be developed for sequential pattern mining so that mining can be adapted to incremental database updates.
Several novel ideas are introduced in the algorithm development: (1) maintaining a set of "almost frequent" sequences as the candidates in the updated database, which has several nice properties and leads to efficient techniques, and (2) two optimization techniques, reverse pattern matching and shared projection, are designed to improve the performance. Reverse pattern matching is used for matching a sequential pattern in a sequence and prune some search space. Shared projection is designed to reduce the number of database projections for some sequences which share a common prefix.

Other

gSpan

This algorithm investigate new approaches for frequent graph-based pattern mining in graph datasets and propose a novel algorithm called gSpan (graph-based Substructure pattern mining), which discovers frequent substructures without candidate generation. gSpan builds a new lexicographic order among graphs, and maps each graph to a unique minimum DFS code as its canonical label. Based on this lexicographic order, gSpan adopts the depth-first search strategy to mine frequent connected subgraphs efficiently.

CPAR

CPAR (Classification based on Predictive Association Rules), which combines the advantages of both associative classification and traditional rule-based classification. Instead of generating a large number of candidate rules as in associative classification, CPAR adopts a greedy algorithm to generate rules directly from training data. Moreover, CPAR generates and tests more rules than traditional rule-based classifiers to avoid missing important rules. To avoid over-fitting, CPAR uses expected accuracy to evaluate each rule and uses the best k rules in prediction.

Features

Overview
Data Cubing
Association Mining
Sequential Mining
Other

About

History
Licence
Contributors






Home © 2005 University of Illinois