Orange-Associate scripting documentation

This module implements FP-growth [1] frequent pattern mining algorithm with bucketing optimization [2] for conditional databases of few items.

The entry points are frequent_itemsets(), association_rules(), and rules_stats() functions below.

[1]: J. Han, J. Pei, Y. Yin, R. Mao.
Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. 2004. https://www.cs.sfu.ca/~jpei/publications/dami03_fpgrowth.pdf
[2]: R. Agrawal, C. Aggarwal, V. Prasad.
Depth first generation of long patterns. 2000. http://www.cs.tau.ac.il/~fiat/dmsem03/Depth%20First%20Generation%20of%20Long%20Patterns%20-%202000.pdf
[3]: R. Agrawal, et al.
Fast Discovery of Association Rules. 1996. http://cs-people.bu.edu/evimaria/cs565/advances.pdf

Examples

Here’s an example from R. Agrawal’s original Apriori article [3 § 12.2.2]. Given a database of transactions:

>>> T = [[1,    3, 4   ],
...      [   2, 3,    5],
...      [1, 2, 3,    5],
...      [   2,       5]]

We can enumerate all frequent itemsets with support greater than two transactions:

>>> from orangecontrib.associate.fpgrowth import *  
>>> itemsets = frequent_itemsets(T, 2)

Note, functions in this module produce generators. The results space can explode quite quickly and can easily be too large to fit in your RAM. By using generators, you can filter the results to your liking as you pass them.

>>> itemsets
<generator object ...>
>>> list(itemsets)
[(frozenset({1}), 2),
 (frozenset({2}), 3),
 (frozenset({3}), 3),
 (frozenset({1, 3}), 2),
 (frozenset({2, 3}), 2),
 (frozenset({5}), 3),
 (frozenset({2, 5}), 3),
 (frozenset({3, 5}), 2),
 (frozenset({2, 3, 5}), 2)]

We can try it with a larger and more real-world database of categorical values:

>>> import Orange
>>> data = Orange.data.Table('zoo')
>>> data
[[1, 0, 0, 1, 0, ... | mammal] {aardvark},
 [1, 0, 0, 1, 0, ... | mammal] {antelope},
 [0, 0, 1, 0, 0, ... | fish] {bass},
 [1, 0, 0, 1, 0, ... | mammal] {bear},
 [1, 0, 0, 1, 0, ... | mammal] {boar},
 ...
]

We can’t use table data directly; we first have to one-hot transform it:

>>> X, mapping = OneHot.encode(data, include_class=True)

We get a database we can use to find frequent itemsets, and a mapping we will use later to revert the transformation.

>>> X
array([[False,  True, ...,  True, False],
       [False,  True, ...,  True, False],
       [ True, False, ..., False, False],
       ...,
       [False,  True, ...,  True, False],
       [ True, False, ..., False, False],
       [ True, False, ..., False, False]], dtype=bool)
>>> sorted(mapping.items())
[(0, (0, 0)),
 (1, (0, 1)),
 (2, (1, 0)),
 (3, (1, 1)),
 ...
 (40, (16, 4)),
 (41, (16, 5)),
 (42, (16, 6))]

We want itemsets with >40% support:

>>> itemsets = dict(frequent_itemsets(X, .4))
>>> len(itemsets)
520

The transaction-coded items corresponding to class values are:

>>> class_items = {item
...                for item, var, _ in OneHot.decode(mapping, data, mapping)
...                if var is data.domain.class_var}
>>> sorted(class_items)
[36, 37, 38, 39, 40, 41, 42]

That makes sense as our class variable has seven values:

>>> data.domain.class_var.values
['amphibian', 'bird', 'fish', 'insect', 'invertebrate', 'mammal', 'reptile']

Now we can generate all association rules that have consequent equal to one of the class values and >80% confidence (i.e. classification rules):

>>> rules = [(P, Q, supp, conf)
...          for P, Q, supp, conf in association_rules(itemsets, .8)
...          if len(Q) == 1 and Q & class_items]
>>> len(rules)
18
>>> rules
[(frozenset({17, 2, 19, 20, 7}), frozenset({41}), 41, 1.0),
 (frozenset({17, 2, 19, 7}), frozenset({41}), 41, 1.0),
 ...
 (frozenset({20, 7}), frozenset({41}), 41, 1.0),
 (frozenset({7}), frozenset({41}), 41, 1.0)]

To make them more helpful, we can use mapping to transform the rules’ items back into table domain values, e.g. for first five rules:

>>> names = {item: '{}={}'.format(var.name, val)
...          for item, var, val in OneHot.decode(mapping, data, mapping)}
>>> for ante, cons, supp, conf in rules[:5]:
...     print(', '.join(names[i] for i in ante), '-->',
...           names[next(iter(cons))],
...           '(supp: {}, conf: {})'.format(supp, conf))
backbone=1, feathers=0, breathes=1, venomous=0, milk=1 --> type=mammal (supp: 41, conf: 1.0)
backbone=1, feathers=0, breathes=1, milk=1 --> type=mammal (supp: 41, conf: 1.0)
backbone=1, breathes=1, venomous=0, milk=1 --> type=mammal (supp: 41, conf: 1.0)
feathers=0, breathes=1, venomous=0, milk=1 --> type=mammal (supp: 41, conf: 1.0)
backbone=1, feathers=0, breathes=1, venomous=0 --> type=mammal (supp: 41, conf: 0.87...)

Reference with further examples below.

fpgrowth.frequent_itemsets(X, min_support=0.2)

Generator yielding frequent itemsets from database X.

Parameters:
  • X (list or numpy.ndarray or scipy.sparse.spmatrix or iterator) – The database of transactions where each transaction is a collection of integer items. If numpy.ndarray, the items are considered to be indices of non-zero columns.
  • min_support (float or int) – If float in range (0, 1), percent of minimal support for itemset to be considered frequent. If int > 1, the absolute number of instances. For example, general iterators don’t have defined length, so you need to pass the absolute minimal support as int.
Yields:
  • itemset (frozenset) – Iteratively yields all itemsets (as frozensets of item indices) with support greater or equal to specified min_support.
  • support (int) – Itemset’s support as number of instaances.

Examples

Have a database of 50 transactions, 100 possible items:

>>> import numpy as np
>>> np.random.seed(0)
>>> X = np.random.random((50, 100)) > .9

Convert it to sparse so we show this type is supported:

>>> from scipy.sparse import lil_matrix  # other types would convert to LIL anyway
>>> X = lil_matrix(X)

Count the number of itemsets of at least two items with support greater than 4%:

>>> sum(1 for itemset, support in frequent_itemsets(X, .05)
...     if len(itemset) >= 2)
72

Let’s get all the itemsets with at least 20% support:

>>> gen = frequent_itemsets(X, .2)
>>> gen
<generator object ...>
>>> itemsets = list(gen)
>>> itemsets
[(frozenset({4}), 11), (frozenset({25}), 10)]

We get the same result by specifying the support as absolute number:

>>> list(frequent_itemsets(X, 10)) == itemsets
True

So the items ‘4’ and ‘25’ (fifth and twenty sixth columns of X) are the only items (and itemsets) that appear 10 or more times. Let’s check this:

>>> (X.sum(axis=0) >= 10).nonzero()[1]
array([ 4, 25])

Conclusion: Given databases of uniformly distributed random data, there’s not much to work with.

fpgrowth.association_rules(itemsets, min_confidence, itemset=None)

Generate association rules ([3] § 12.3) from dict of itemsets’ supports (from frequent_itemsets()). If itemset is provided, only generate its rules.

Parameters:
  • itemsets (dict) – A dict mapping itemsets to their supports. Can be generated by feeding the output of frequent_itemsets() to dict constructor.
  • min_confidence (float) – Confidence percent. Defined as itemset_support / antecedent_support.
  • itemset (frozenset) – Itemset the association rules of which we are interested in.
Yields:
  • antecedent (frozenset) – The LHS of the association rule.
  • consequent (frozenset) – The RHS of the association rule.
  • support (int) – The number of instances supporting (containing) this rule.
  • confidence (float) – total_support / lhs_support.

Examples

>>> np.random.seed(0)
>>> N = 100
>>> X = np.random.random((N, 100)) > .9

Find all itemsets with at least 5% support:

>>> itemsets = dict(frequent_itemsets(X, .05))
>>> len(itemsets)
116

Generate all association rules from these itemsets with minimum 50% confidence:

>>> rules = association_rules(itemsets, .5)
>>> rules
<generator object ...>
>>> rules = list(rules)
>>> len(rules)
7
>>> rules
[(frozenset({36}), frozenset({25}), 5, 0.55...),
 (frozenset({63}), frozenset({58}), 5, 0.5),
 ...
 (frozenset({30}), frozenset({32}), 5, 0.55...),
 (frozenset({75}), frozenset({98}), 5, 0.5)]

Or only the rules for a particular itemset:

>>> list(association_rules(itemsets, .3, frozenset({75, 98})))
[(frozenset({75}), frozenset({98}), 5, 0.5),
 (frozenset({98}), frozenset({75}), 5, 0.45...)]
fpgrowth.rules_stats(rules, itemsets, n_examples)

Generate additional stats for rules generated by association_rules().

Parameters:
  • rules (iterable) – Rules as output by association_rules().
  • itemsets (dict) – The itemsets as obtained by dict(frequent_itemsets(...)).
  • n_examples (int) – The total number of instances (for calculating coverage, lift, and leverage).
Yields:
  • atecedent (frozenset) – The LHS of the association rule.
  • consequent (frozenset) – The RHS of the association rule.
  • support (int) – Support as an absolute number of instances.
  • confidence (float) – The confidence percent, calculated as: total_support / lhs_rupport.
  • coverage (float) – Calculated as: lhs_support / n_examples
  • strength (float) – Calculated as: rhs_support / lhs_examples
  • lift (float) – Calculated as: n_examples * total_support / lhs_support / rhs_support
  • leverage (float) – Calculated as: (total_support * n_examples - lhs_support * rhs_support) / n_examples**2

Examples

>>> N = 30
>>> X = np.random.random((N, 50)) > .9
>>> itemsets = dict(frequent_itemsets(X, .1))
>>> rules = association_rules(itemsets, .6)
>>> list(rules_stats(rules, itemsets, N))
[(frozenset({15}), frozenset({0}), 3, 0.75, 0.13..., 1.5, 3.75, 0.073...),
 (frozenset({47}), frozenset({22}), 3, 0.6, 0.16..., 1.4, 2.57..., 0.061...),
 (frozenset({27}), frozenset({22}), 4, 0.66..., 0.2, 1.16..., 2.85..., 0.086...),
 (frozenset({19}), frozenset({22}), 3, 0.6, 0.16..., 1.4, 2.57..., 0.061...)]
class fpgrowth.OneHot

Encode discrete Orange.data.Table into a 2D array of binary attributes.

static encode(table, include_class=False)

Return a tuple of (bool (one hot) ndarray, {col: (variable_index, value_index)} mapping)

If the input table is sparse, a list of nonzero column indices per row (LIL rows) is returned instead of the one-hot ndarray.

static decode(itemset, table, mapping)

Yield sorted (item, variable, value) tuples (one for each item)

fpgrowth.preprocess(table)

This function applies a one-hot transform to Orange data table, making it suitable as an X input into frequent_itemsets() above.

For a more fine-grained control, use OneHot methods directly.

Parameters:table (Orange.data.Table) – The table to encode into X compatible with frequent_itemsets() above.
Returns:X – The table’s X with one-hot tranfsorm applied.
Return type:numpy.ndarray

Examples

For a more concrete example, i.e. using non-uniform data:

>>> from Orange.data import Table
>>> table = Table('voting')
>>> table
[[n, y, n, y, y, ... | republican],
 [n, y, n, y, y, ... | republican],
 [?, y, y, ?, y, ... | democrat],
 [n, y, y, n, ?, ... | democrat],
 [y, y, y, n, y, ... | democrat],
 ...
]

Table, as-is, can’t be used with frequent_itemsets() directly (it can, but it would produce garbage). We first need to one-hot transform it, i.e. make binary columns for each value of each of its discrete variables.

>>> X = preprocess(table)
>>> X
array([[ True, False, False, ...,  True,  True, False],
       [ True, False, False, ..., False,  True, False],
       ...,
       [ True, False,  True, ...,  True,  True, False],
       [ True, False, False, ..., False,  True, False]], dtype=bool)

Now we can use it.

Note: the transformation includes class if it’s discrete. For a finer-grained control, including the variable values to columns mapping, use OneHot class directly.