Author: Jian Pei
, Jianyong Wang
Sampled data: N/A
Sequential Pattern Mining Using PrefixSpan
Prepared by Jianyong Wang, UIUC
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth
Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umeshwar Dayal, Mei-Chun Hsu
1. System usage
1.1 Input parameters
1st argument: Dataset file name
2nd argument: Relative support in decimal
3rd argument: Largest item number found in sequences
Usage example: prefixspan C10T8S8I8.data 0.005 1000
prefixspan is the executable file name (In this package, we have two
executable files, prefixspan.exe can output the frequent sequences while
prefixspan_no_output.exe does not output the frequent sequences).
C10T8S8I8.data is the dataset file being mined, 0.005 is the relative
support, 1000 is the largest itemID in dataset file (i.e., there are
totally 1000 distinct items in C10T8S8I8.data).
As to the dataset file format, see section 1.3.
There are two programs in this package, prefixspan and
prefixspan_no_output will not output the frequent sequences. For prefixspan,
the discovered frequent sequences are printed into a file called
"frequent.dat". In addition, if program encounters errors, they will be
printed to a file named "error.tmp". Program status as it is executing and
the final results (such as timing) are printed to stdout (console).
Each line in the result file, "frequent.dat", contains a frequent sequence
in the form:
(itemID00, ... , itemID0i) (itemID10, ... , itemID1j) ...
(itemIDm0, ... , itemIDmn) : Relative support
Here is an example:
(242) (17 261) : 0.004087
1.3 Input file format
PrefixSpan takes binary input file format. Each integer is represented by
4 bytes, in the endian format of the machine.
Usually a sequence database consists of a series of sequences, and each
sequence is composed of several transactions (also called events) sorted
in time ascending order. Each distinct item in the database should be
assigned an itemID numbered beginning from 0, and the items in each
transaction should be sorted in ascending order according to their itemID.
Also, the sequential database being mined is in binary format, each sequence
ends with a -2 and every transaction is followed by a -1. Here is a
2 5 7 -1 1 2 -1 3 9 -1 4 -1 -2
This sequence equals <(2 5 7)(1 2)(3 9)(4)>, there are totally 4 transactions
(i.e., events), and each transaction contains 3, 2, 2, 1 items respectively.
An example input file:
5 7 -1 1 2 -1 3 9 -1 4 -1 -2
1 3 8 -1 4 -1 -2
4 -1 4 5 -1 -2