Features
The IlliMine project is an open-source project to provide various tools
for data mining and machine learning. Currently, the project features
algorithms in the following set of areas:
Data Cubing,
Association Mining,
Sequential Mining,
Other. Online demos of some of the algorithms may
be found here. Other resources (publications,
README's, etc) are available here.
|
Data Cubing
Data cubing is process of computing the set of all possible
group-by's from a base table. It facilitates many OLAP operations such
as drilling-down or roll-up. Our package features several algorithms
for computing the full data cube (all group-by's), iceberg cubes
(group-by's satisfying a minimum support value), and closed cubes
(group-by's which are not subsumed by any other group-by's).
BUC
BUC is "Bottom-Up Computation" of full data cubes or iceberg cubes.
The algorithm was published by Beyer and Ramakrishnan in SIGMOD'99.
It uses the so-called Apriori concept to compute iceberg cubes. That
is, if a group-by cannot satisfy the minimum support, none of the
group-by's that are more general than it can satisfy the minimum support
either. And thus, the computation can stop along that particular
group-by branch.
Star-Cubing
Star-Cubing uses a hyper-tree structure, called Star-Tree to facilitate
either full or iceberg cube computation. Each level in the tree
represents a dimension in the base cuboid. The algorithm takes the
advantages of the top-down and bottom-up models. On the global
computation order, it uses the simultaneous aggregation similar to
MultiWay. However, it has a sub-layer underneath based on the bottom-up
model by exploring the notion of shared dimension. This integration
allows the algorithm to aggregate on multiple dimensions, while it can
still partition parent group-by's and use Apriori-based pruning on child
group-by's.
Star-Cubing also introduced a special value "Star" for attributes whose
count is less than the minimum support, which has been proved useful for
skewed data. It performs well on dense, skewed and not-so-sparse data.
Frag-Cubing
Frag-Cubing focuses on the problem of high-dimensional OLAP.
Traditional cubing technologies quickly become ineffective when the
number of dimensions in the data increases. This is a well-known
deficiency. In Frag-Cubing, low dimensional shell-frags (ie, fragments
of the high dimensional cube) are constructed. Combined with inverted
indexes, they facilitate query processing in the original
high-dimensional data cube.
MM-Cubing
MM-Cubing is a density-based method, where MM represents major and minor
values. By distinguishing those values, the lattice space is factorized
into one dense subspace and several sparse subspaces, and MM-Cubing
applies specialized methods for each subspace. The dense subspace is
computed by simultaneous aggregation. The other subspaces are solved by
recursive calls.
C-Cubing
In this algorithm, C-Cubing, a new closedness-checking method is
proposed. With a very small overhead, the closedness information of a
cell can be aggregated as a measure, and the closedness-checking can be
simply done by testing this special measure. The new approach is
entitled aggregation-based c-checking. The new method C-Cubing is
implemented in two successful iceberg cubing algorithms: MM-Cubing and
Star-Cubing.
|
Association Mining
Association mining (frequent itemset mining) is one of the biggest
problems in data mining. An association rule has the form X -> Y where
X and Y are sets of items. The problem of finding association rules
(with support and confidence thresholds) can be viewed as two separate
sub-problems. The first is finding the set of frequent itemsets. The
second is constructing the rules from the frequent itemsets.
Most association mining algorithms focus on the frequent pattern
mining problem. Let I be a set of items, and a transaction
database DB = {T_1, T_2, ..., T_n}, where T_i is a transaction which
contains a set of items in I. The support (or occurrence frequency) of a
pattern A, which is a set of items, is the number of transactions
containing A in DB. A is a frequent pattern if A's support is no less
than a predefined minimum support threshold. Given a transaction
database DB and a minimum support threshold, the problem is to find the
complete set of frequent patterns.
FPGrowth
A frequent pattern tree, or FP-tree for short, is constructed, which is
an extended prex-tree structure storing crucial, quantitative
information about frequent patterns. Only frequent length-1 items will
have nodes in the tree, and the tree nodes are arranged in such a way
that more frequently occurring nodes will have better chances of sharing
nodes than less frequently occurring ones.
An FP-tree-based pattern fragment growth mining method, is developed,
which starts from a frequent length-1 pattern (as an initial prefix
pattern), examines only its conditional pattern base (a sub-database
which consists of the set of frequent items co-occurring with the prefix
pattern), constructs its (conditional) FP-tree, and performs mining
recursively with such a tree. The pattern growth is achieved via
concatenation of the prefix pattern with the new ones generated from a
conditional FP-tree. Since the frequent itemset in any transaction is
always encoded in the corresponding path of the frequent pattern trees,
pattern growth ensures the completeness of the result.
In the IlliMine package, there are 2 different implementations of
FPGrowth.
CLOSET+
An itemset A is a frequent closed itemset if it is frequent and there
exists no proper superset A' such that A's support (frequency count) is
equal to that of A.
CLOSET+ follows the popular divide-and-conquer paradigm and the
depth-first search strategy which has been shown to be a winner for
mining long patterns by several efficient frequent pattern mining
algorithms. It uses FP-tree as the compression technique. A depth-first
search and horizontal format-based method like CLOSET+ will compute the
local frequent items of a certain prefix by building and scanning its
projected database. A hybrid tree-projection method is introduced to
improve the space efficiency.
Unlike frequent itemset mining, during the closed itemset mining process
there may exist some prefix itemsets that are unpromising to be used to
grow closed itemsets. Such unpromising prefix itemsets are detected and
removed as quickly as possible. Other techniques are introduced to
prune the search space and speed up the mining process.
RPMine
A major challenge in frequent pattern mining is the sheer size of its
mining results. In many cases, a high minimum support threshold may
discover only common sense patterns but a low one may generate an
explosive number of output patterns, which severely restricts its usage.
This work studies the problem of compressing frequent-pattern sets.
Typically, frequent patterns can be clustered with a tightness measure
delta (called delta-cluster), and a representative pattern can be
selected for each cluster. Unfortunately, finding a minimum set of
representative patterns is NP-Hard. This works develops two greedy
methods, RPglobal and RPlocal. The former has the guaranteed
compression bound but higher computational complexity. The latter
sacrifices the theoretical bounds but is far more efficient.
Performance study shows that the compression quality using RPlocal is
very close to RPglobal, and both can reduce the number of closed
frequent patterns by almost two orders of magnitude.
|
Sequential Mining
Given a set of sequences, where each sequence consists of a list of
elements and each element consists of a set of items, and given a
user-specified minimum support threshold, sequential pattern mining is
to find all of the frequent subsequences, i.e., the subsequences whose
occurrence frequency in the set of sequences is no less than minimum
support.
PrefixSpan
The general idea here is to examine only the prefix subsequences and
project only their corresponding postfix subsequences into projected
databases. In each projected database, sequential patterns are grown by
exploring only local frequent patterns. To further improve mining
efficiency, two kinds of database projections are explored:
level-by-level projection and bi-level projection, and an optimization
technique which explores pseudo-projection is developed.
CloSpan
Closed sequential patterns are frequent subsequences for which there are
no super-sequences that have the same support. The CloSpan algorithm
aims to discover such patterns.
IncSpan
Many real life sequence databases grow incrementally. It is undesirable
to mine sequential patterns from scratch each time when a small set of
sequences grow, or when some new sequences are added into the database.
Incremental algorithm should be developed for sequential pattern mining
so that mining can be adapted to incremental database updates.
Several novel ideas are introduced in the algorithm development: (1)
maintaining a set of "almost frequent" sequences as the candidates in
the updated database, which has several nice properties and leads to
efficient techniques, and (2) two optimization techniques, reverse
pattern matching and shared projection, are designed to improve the
performance. Reverse pattern matching is used for matching a sequential
pattern in a sequence and prune some search space. Shared projection is
designed to reduce the number of database projections for some sequences
which share a common prefix.
|
Other
gSpan
This algorithm investigate new approaches for frequent graph-based
pattern mining in graph datasets and propose a novel algorithm called
gSpan (graph-based Substructure pattern mining), which discovers
frequent substructures without candidate generation. gSpan builds a new
lexicographic order among graphs, and maps each graph to a unique
minimum DFS code as its canonical label. Based on this lexicographic
order, gSpan adopts the depth-first search strategy to mine frequent
connected subgraphs efficiently.
CPAR
CPAR (Classification based on Predictive Association Rules), which
combines the advantages of both associative classification and traditional
rule-based classification. Instead of generating a large number of
candidate rules as in associative classification, CPAR adopts a greedy
algorithm to generate rules directly from training data. Moreover, CPAR
generates and tests more rules than traditional rule-based classifiers to
avoid missing important rules. To avoid over-fitting, CPAR uses expected
accuracy to evaluate each rule and uses the best k rules in prediction.
|
|
 |
|