OrangeAssociate scripting documentation¶
This module implements FPgrowth [1] frequent pattern mining algorithm with bucketing optimization [2] for conditional databases of few items.
The entry points are frequent_itemsets()
, association_rules()
, and
rules_stats()
functions below.
 [1]: J. Han, J. Pei, Y. Yin, R. Mao.
 Mining Frequent Patterns without Candidate Generation: A FrequentPattern Tree Approach. 2004. https://www.cs.sfu.ca/~jpei/publications/dami03_fpgrowth.pdf
 [2]: R. Agrawal, C. Aggarwal, V. Prasad.
 Depth first generation of long patterns. 2000. http://www.cs.tau.ac.il/~fiat/dmsem03/Depth%20First%20Generation%20of%20Long%20Patterns%20%202000.pdf
 [3]: R. Agrawal, et al.
 Fast Discovery of Association Rules. 1996. http://cspeople.bu.edu/evimaria/cs565/advances.pdf
Examples
Here’s an example from R. Agrawal’s original Apriori article [3 § 12.2.2]. Given a database of transactions:
>>> T = [[1, 3, 4 ],
... [ 2, 3, 5],
... [1, 2, 3, 5],
... [ 2, 5]]
We can enumerate all frequent itemsets with support greater than two transactions:
>>> from orangecontrib.associate.fpgrowth import * # doctest: +SKIP
>>> itemsets = frequent_itemsets(T, 2)
Note, functions in this module produce generators. The results space can explode quite quickly and can easily be too large to fit in your RAM. By using generators, you can filter the results to your liking as you pass them.
>>> itemsets
<generator object ...>
>>> list(itemsets)
[(frozenset({1}), 2),
(frozenset({2}), 3),
(frozenset({3}), 3),
(frozenset({1, 3}), 2),
(frozenset({2, 3}), 2),
(frozenset({5}), 3),
(frozenset({2, 5}), 3),
(frozenset({3, 5}), 2),
(frozenset({2, 3, 5}), 2)]
We can try it with a larger and more realworld database of categorical values:
>>> import Orange
>>> data = Orange.data.Table('zoo')
>>> data
[[1, 0, 0, 1, 0, ...  mammal] {aardvark},
[1, 0, 0, 1, 0, ...  mammal] {antelope},
[0, 0, 1, 0, 0, ...  fish] {bass},
[1, 0, 0, 1, 0, ...  mammal] {bear},
[1, 0, 0, 1, 0, ...  mammal] {boar},
...
]
We can’t use table data directly; we first have to onehot transform it:
>>> X, mapping = OneHot.encode(data, include_class=True)
We get a database we can use to find frequent itemsets, and a mapping we will use later to revert the transformation.
>>> X
array([[False, True, ..., True, False],
[False, True, ..., True, False],
[ True, False, ..., False, False],
...,
[False, True, ..., True, False],
[ True, False, ..., False, False],
[ True, False, ..., False, False]], dtype=bool)
>>> sorted(mapping.items())
[(0, (0, 0)),
(1, (0, 1)),
(2, (1, 0)),
(3, (1, 1)),
...
(40, (16, 4)),
(41, (16, 5)),
(42, (16, 6))]
We want itemsets with >40% support:
>>> itemsets = dict(frequent_itemsets(X, .4))
>>> len(itemsets)
520
The transactioncoded items corresponding to class values are:
>>> class_items = {item
... for item, var, _ in OneHot.decode(mapping, data, mapping)
... if var is data.domain.class_var}
>>> sorted(class_items)
[36, 37, 38, 39, 40, 41, 42]
That makes sense as our class variable has seven values:
>>> data.domain.class_var.values
['amphibian', 'bird', 'fish', 'insect', 'invertebrate', 'mammal', 'reptile']
Now we can generate all association rules that have consequent equal to one of the class values and >80% confidence (i.e. classification rules):
>>> rules = [(P, Q, supp, conf)
... for P, Q, supp, conf in association_rules(itemsets, .8)
... if len(Q) == 1 and Q & class_items]
>>> len(rules)
18
>>> rules
[(frozenset({17, 2, 19, 20, 7}), frozenset({41}), 41, 1.0),
(frozenset({17, 2, 19, 7}), frozenset({41}), 41, 1.0),
...
(frozenset({20, 7}), frozenset({41}), 41, 1.0),
(frozenset({7}), frozenset({41}), 41, 1.0)]
To make them more helpful, we can use mapping
to transform the rules’ items
back into table domain values, e.g. for first five rules:
>>> names = {item: '{}={}'.format(var.name, val)
... for item, var, val in OneHot.decode(mapping, data, mapping)}
>>> for ante, cons, supp, conf in rules[:5]:
... print(', '.join(names[i] for i in ante), '>',
... names[next(iter(cons))],
... '(supp: {}, conf: {})'.format(supp, conf))
backbone=1, feathers=0, breathes=1, venomous=0, milk=1 > type=mammal (supp: 41, conf: 1.0)
backbone=1, feathers=0, breathes=1, milk=1 > type=mammal (supp: 41, conf: 1.0)
backbone=1, breathes=1, venomous=0, milk=1 > type=mammal (supp: 41, conf: 1.0)
feathers=0, breathes=1, venomous=0, milk=1 > type=mammal (supp: 41, conf: 1.0)
backbone=1, feathers=0, breathes=1, venomous=0 > type=mammal (supp: 41, conf: 0.87...)
Reference with further examples below.

fpgrowth.
frequent_itemsets
(X, min_support=0.2)¶ Generator yielding frequent itemsets from database X.
Parameters:  X (list or numpy.ndarray or scipy.sparse.spmatrix or iterator) – The database of transactions where each transaction is a collection of integer items. If numpy.ndarray, the items are considered to be indices of nonzero columns.
 min_support (float or int) – If float in range (0, 1), percent of minimal support for itemset to be considered frequent. If int > 1, the absolute number of instances. For example, general iterators don’t have defined length, so you need to pass the absolute minimal support as int.
Yields:  itemset (frozenset) – Iteratively yields all itemsets (as frozensets of item indices) with support greater or equal to specified min_support.
 support (int) – Itemset’s support as number of instaances.
Examples
Have a database of 50 transactions, 100 possible items:
>>> import numpy as np >>> np.random.seed(0) >>> X = np.random.random((50, 100)) > .9
Convert it to sparse so we show this type is supported:
>>> from scipy.sparse import lil_matrix # other types would convert to LIL anyway >>> X = lil_matrix(X)
Count the number of itemsets of at least two items with support greater than 4%:
>>> sum(1 for itemset, support in frequent_itemsets(X, .05) ... if len(itemset) >= 2) 72
Let’s get all the itemsets with at least 20% support:
>>> gen = frequent_itemsets(X, .2) >>> gen <generator object ...>
>>> itemsets = list(gen) >>> itemsets [(frozenset({4}), 11), (frozenset({25}), 10)]
We get the same result by specifying the support as absolute number:
>>> list(frequent_itemsets(X, 10)) == itemsets True
So the items ‘4’ and ‘25’ (fifth and twenty sixth columns of X) are the only items (and itemsets) that appear 10 or more times. Let’s check this:
>>> (X.sum(axis=0) >= 10).nonzero()[1] array([ 4, 25])
Conclusion: Given databases of uniformly distributed random data, there’s not much to work with.

fpgrowth.
association_rules
(itemsets, min_confidence, itemset=None)¶ Generate association rules ([3] § 12.3) from dict of itemsets’ supports (from
frequent_itemsets()
). If itemset is provided, only generate its rules.Parameters:  itemsets (dict) – A dict mapping itemsets to their supports. Can be generated by feeding the output of frequent_itemsets() to dict constructor.
 min_confidence (float) – Confidence percent. Defined as itemset_support / antecedent_support.
 itemset (frozenset) – Itemset the association rules of which we are interested in.
Yields:  antecedent (frozenset) – The LHS of the association rule.
 consequent (frozenset) – The RHS of the association rule.
 support (int) – The number of instances supporting (containing) this rule.
 confidence (float) –
total_support / lhs_support
.
Examples
>>> np.random.seed(0) >>> N = 100 >>> X = np.random.random((N, 100)) > .9
Find all itemsets with at least 5% support:
>>> itemsets = dict(frequent_itemsets(X, .05)) >>> len(itemsets) 116
Generate all association rules from these itemsets with minimum 50% confidence:
>>> rules = association_rules(itemsets, .5) >>> rules <generator object ...> >>> rules = list(rules) >>> len(rules) 7 >>> rules [(frozenset({36}), frozenset({25}), 5, 0.55...), (frozenset({63}), frozenset({58}), 5, 0.5), ... (frozenset({30}), frozenset({32}), 5, 0.55...), (frozenset({75}), frozenset({98}), 5, 0.5)]
Or only the rules for a particular itemset:
>>> list(association_rules(itemsets, .3, frozenset({75, 98}))) [(frozenset({75}), frozenset({98}), 5, 0.5), (frozenset({98}), frozenset({75}), 5, 0.45...)]

fpgrowth.
rules_stats
(rules, itemsets, n_examples)¶ Generate additional stats for rules generated by
association_rules()
.Parameters:  rules (iterable) – Rules as output by association_rules().
 itemsets (dict) – The itemsets as obtained by dict(frequent_itemsets(…)).
 n_examples (int) – The total number of instances (for calculating coverage, lift, and leverage).
Yields:  atecedent (frozenset) – The LHS of the association rule.
 consequent (frozenset) – The RHS of the association rule.
 support (int) – Support as an absolute number of instances.
 confidence (float) – The confidence percent, calculated as:
total_support / lhs_rupport
.  coverage (float) – Calculated as:
lhs_support / n_examples
 strength (float) – Calculated as:
rhs_support / lhs_examples
 lift (float) – Calculated as:
n_examples * total_support / lhs_support / rhs_support
 leverage (float) – Calculated as:
(total_support * n_examples  lhs_support * rhs_support) / n_examples**2
Examples
>>> N = 30 >>> X = np.random.random((N, 50)) > .9 >>> itemsets = dict(frequent_itemsets(X, .1)) >>> rules = association_rules(itemsets, .6) >>> list(rules_stats(rules, itemsets, N)) [(frozenset({15}), frozenset({0}), 3, 0.75, 0.13..., 1.5, 3.75, 0.073...), (frozenset({47}), frozenset({22}), 3, 0.6, 0.16..., 1.4, 2.57..., 0.061...), (frozenset({27}), frozenset({22}), 4, 0.66..., 0.2, 1.16..., 2.85..., 0.086...), (frozenset({19}), frozenset({22}), 3, 0.6, 0.16..., 1.4, 2.57..., 0.061...)]

class
fpgrowth.
OneHot
¶ Encode discrete Orange.data.Table into a 2D array of binary attributes.

static
encode
(table, include_class=False)¶ Return a tuple of (bool (one hot) ndarray, {col: (variable_index, value_index)} mapping)
If the input table is sparse, a list of nonzero column indices per row (LIL rows) is returned instead of the onehot ndarray.

static
decode
(itemset, table, mapping)¶ Yield sorted (item, variable, value) tuples (one for each item)

static

fpgrowth.
preprocess
(table)¶ This function applies a onehot transform to Orange data table, making it suitable as an X input into
frequent_itemsets()
above.For a more finegrained control, use
OneHot
methods directly.Parameters: table (Orange.data.Table) – The table to encode into X compatible with frequent_itemsets() above. Returns: X – The table’s X with onehot tranfsorm applied. Return type: numpy.ndarray Examples
For a more concrete example, i.e. using nonuniform data:
>>> from Orange.data import Table >>> table = Table('voting') >>> table [[n, y, n, y, y, ...  republican], [n, y, n, y, y, ...  republican], [?, y, y, ?, y, ...  democrat], [n, y, y, n, ?, ...  democrat], [y, y, y, n, y, ...  democrat], ... ]
Table, asis, can’t be used with
frequent_itemsets()
directly (it can, but it would produce garbage). We first need to onehot transform it, i.e. make binary columns for each value of each of its discrete variables.>>> X = preprocess(table) >>> X array([[ True, False, False, ..., True, True, False], [ True, False, False, ..., False, True, False], ..., [ True, False, True, ..., True, True, False], [ True, False, False, ..., False, True, False]], dtype=bool)
Now we can use it.
Note: the transformation includes class if it’s discrete. For a finergrained control, including the variable values to columns mapping, use
OneHot
class directly.