1
More precisely: their Cartesian product.
2
See [Quinlan1992].
3
See e.g. [Mitchell1996] for details on ordinary decision tree learning and attribute-value representation of data.
4
The educated reader may have already realized that we are actually using the purely functional, lazy language Haskell for representation. This may allow us to exploit the powerful mathematical tool of equational reasoning to analyze and transform learnt models. More details can be found here: http://www.haskell.org
5
Also sometimes referred to as sum type.
6
The keyword type can be used to introduce type synonyms for readability. For example, Spiced is more intuitive than just saying Bool for boolean values.
7
We will ignore handling of non-discrete (e.g. numeric) data in this paper, though this is can be easily integrated into the new concept.
8
Algebraic datatypes are also often referred to as polynomial or sum-of-product types.
9
The name given to some set of values.
10
A more formal representation of this algorithm will be available in an upcoming, longer version of this paper, which will appear as technical report at: http://www.oefai.at/oefai/tr-online
11
A fairly fully-featured implementation, which can handle industrial size data efficiently, can be found here: http://www.oefai.at/~markus/aifad
12
See [Shannon and Weaver1949] on the mathematical theory behind this concept.
13
Alternatives in the corresponding sum type.
14
Side note: C4.5 uses the slightly adapted gain ratio criterion: it first selects input variables that have at least average gain and chooses the one of those having the highest gain ratio. Unfortunately, the rationale behind this idea as explained in [Quinlan1986] and repeated in other publications (e.g. [Mitchell1996]) is mathematically not justified: they claim that as H(A) becomes smaller, the gain ratio becomes larger. Looking at the fraction, this may seem right at first sight, but is invalid, because the information gain is dependent on H(A): variables that contain little information are less likely to evenly partition other data, which generally implies smaller gains. The empirical results concerning reduced number of nodes in decision trees can be easily explained theoretically, but do absolutely not imply anything about predictive accuracy. The claim that predictive accuracy can be higher in certain constructed cases still allows that it could be smaller in others. Experiments on 32 datasets of real data that we have performed did not show any mentionable advantage for either gain ratio or the modified criterion.
15
All numbers rounded to two digits after the decimal point.
16
Note, how different types of sub-variables are added to the input variables of each partition!
17
Note that ∀ X. Hd(X) ≤ Hi(X).
18
It is possible to continue learning in default branches by continuing with the dataset before the split but using the remaining variables for learning instead. This, however, raises some interesting questions concerning computational complexity that we do not attempt to answer here.
19
This, too, is implemented in the AIFAD-system.
20
A factor of five is usually only observed on data containing many missing values, which are encoded in a structured way.
21
For details on this languages, see: http://www.ocaml.org
22
See e.g. [Floyd and Beigel1994] for an introduction to the syntax of formal languages.
23
See e.g. [Stenlund1972] for details.
24
A thorough introduction to natural deduction can be found in [Girard1987].
25
For details on such rewritings see e.g. [Bundy1983].
26
See e.g. [Kearns and Vazirani1994] for an introduction.
27
See e.g. [Taylor1999] for details on category theory.
28
A good introduction to this approach of deriving fold/unfold-functions from type definitions can be found in [Meijer and Hutton1995].
29
Or are continuous on infinite data. See [Winskel1993] on the relevance of continuity for describing the observable behavior of programs.