The educated
reader may have already realized that we are actually using the purely
functional, lazy language Haskell for representation. This may
allow us to exploit the powerful mathematical tool of equational
reasoning to analyze and transform learnt models. More details can
be found here: http://www.haskell.org
The keyword type can be used to introduce
type synonyms for readability. For example, Spiced is more intuitive
than just saying Bool for boolean values.
A more formal
representation of this algorithm will be available in an upcoming,
longer version of this paper, which will appear as technical report at:
http://www.oefai.at/oefai/tr-online
Side note: C4.5 uses the slightly adapted
gain ratio criterion: it first selects input variables that
have at least average gain and chooses the one of those having the
highest gain ratio. Unfortunately, the rationale behind this idea
as explained in [Quinlan1986] and repeated in other publications
(e.g. [Mitchell1996]) is mathematically not justified:
they claim that as H(A) becomes smaller, the gain ratio
becomes larger. Looking at the fraction, this may seem right at first
sight, but is invalid, because the information gain is dependent on
H(A): variables that contain little information are less
likely to evenly partition other data, which generally implies smaller
gains. The empirical results concerning reduced number of nodes in
decision trees can be easily explained theoretically, but do absolutely
not imply anything about predictive accuracy. The claim that predictive
accuracy can be higher in certain constructed cases still allows that
it could be smaller in others. Experiments on 32 datasets of real data
that we have performed did not show any mentionable advantage for either
gain ratio or the modified criterion.
It is
possible to continue learning in default branches by continuing with the
dataset before the split but using the remaining variables for learning
instead. This, however, raises some interesting questions concerning
computational complexity that we do not attempt to answer here.