In this section we will show that ordinary decision tree learning can be cleanly and efficiently generalized such that instead of only handling usual attribute-value representations it can operate on arbitrarily structured, even recursive input and output data. We call this algorithm10 AIFAD11 (Automated Induction of Functions over Algebraic Datatypes). Note, however, that this generalization will not allow us to induce recursive functions, which is still an open but extremely interesting research topic.
Assume that we have a product of sum types on both the left- and right hand side of a classification task, e.g.:
(Meal, Drink) → (Satisfaction, Satisfaction)
where the corresponding type constructors are defined as follows:
data Meal = Pizza PizzaTopping
| WienerSchnitzel Diameter
| TapirSoup Spiced
data PizzaTopping = Cheese | Tomatoes | Mushrooms
data Diameter = Large | Small
data Spiced = False | True
Let’s consider the following dataset of input-output mappings that we want to learn:
(WienerSchnitzel Large, Water) → (Very Low, High). (WienerSchnitzel Small, Beer) → (Very Low, Low). (TapirSoup True, Beer) → (Very Low, High). (TapirSoup False, Beer) → (Very High, High).
The first thing we can do is look out for (parts of) structures on the right hand side and see if they are unique for all samples. One of the following three scenarios can happen:
In our example dataset case two applies: the topmost constructor of all first satisfaction values is unique. Thus, we can already construct part of the model:
let
(v1, v2) = …
in
(Very v1, v2)
The factorized set of samples now looks as follows:
(WienerSchnitzel Large, Water) → (Low, High). (WienerSchnitzel Small, Beer) → (Low, Low). (TapirSoup True, Beer) → (Low, High). (TapirSoup False, Beer) → (High, High).
Since we are not done yet, we will have to choose an input variable to discriminate further. Before we can do this, we will have to get rid of redundant information by factoring out common structure again, thus leaving only input variables whose topmost constructors are not unique. In our example above this is already the case. If, instead, we had the following input data:
(WienerSchnitzel Large, Water). (WienerSchnitzel Large, Beer). (WienerSchnitzel Small, Beer). (WienerSchnitzel Small, Beer).
Then we would have to factorize it such that the data constructor WienerSchnitzel is dropped, revealing the non-redundant diameter information beneath it. Of course, this process would have to be repeated recursively if sub-attributes were redundant, too. Some old variables may completely vanish and arbitrarily many new ones may enter the game depending on the structure of the input data.
If new variables get added, we will have to record for each of them, which variables (and possibly sub-attributes) have to be matched in order to access the new variables. In the upper case this would mean recording for the freshly added diameter variable that we have to match the meal variable first to make the diameter of the WienerSchnitzel available. Of course, the match case would only contain two branches then: one for WienerSchnitzel whereas the other would have to be taken in any other case, since we do not have any concrete samples for refining this partition. The match would then look as follows (the underscore being a catch-all dummy variable):
case meal of
WienerSchnitzel diameter → …
_ → …
It might be tempting to think that we could immediately add a pattern match to the model to get at the non-redundant constructors. But this is a bad idea for several reasons:
The classification process should be driven as far as possible by available data and at the same time require as little data as possible to give a result, which would be violated here.
Therefore, the optimum method to destruct (match) input data and to construct output data is by matching (redundant) input data as late as possible down the model tree while constructing (redundant) output data as early as possible up the tree. This is, why we have to delay matching by remembering for newly added variables, which matches still need to be performed directly in advance.
If we now assume that indeed the diameter had been chosen for partitioning the training data, the model would be constructed like this:
case meal of
WienerSchnitzel diameter →
case diameter of
Large → …
Small → …
_ → …
This guarantees that meal is matched as late as possible, i.e. when diameter is really needed.
Decision tree learning as performed by e.g. ID3 or C4.5 uses greedy heuristics for choosing variables that build on the concept of statistical entropy12. The idea is to consume (match) as little information as possible on the left hand side of the samples while gaining (producing, constructing) as much information as possible on the right hand side. To do this we will first have to generalize the common notions of entropy (average information contents) from attribute-value representations to algebraic datatypes. Though, formally speaking, there is only one statistically correct entropy measure, we will see that several other reasonable entropy-like measures can be designed, all of which have different properties and trade-offs.
For one non-structured variable X containing NX elements that come from classes13 C1,…,Ck, entropy in bits H(X) is defined as:
| H(X) = − |
| pX(c) × log2(pX(c)) |
where function “classes” returns the number of constructors used to form the discriminated union of variable X, and pX(c) is the relative frequency (probability) of class c in variable X (cardX(c) returns the number of values in variable X starting with constructor c):
| pX(c) = |
|
As a side note, calculation of entropy can be rearranged for slightly more efficient computation as follows if the class histogram is known:
| H(X) = log2(NX) − |
|
If we want to compute the joint (dependent) entropy of a product of n variables X1,…,Xn, the corresponding formula is:
| Hd(X1,…,Xn) = |
| H(Xi | Xi−1,…,X1)) |
More pragmatically speaking, we first compute the entropy for the first variable. Then we split up the remaining variables into partitions depending on the class values of the first variable and recurse for each of those. The returned entropies of these partitions are weighted by their size and summed up:
| Hd(X1,…,Xn) = H(X1) + |
| pX1(c) × Hd(X2,…,Xn|X1=c) |
We call the above entropy formula dependent entropy. If the variables X1,…,Xn are independent of each other, i.e. the partitions are equally large at each split, the joint (independent) entropy reduces to:
| Hi(X1,…,Xn) = |
| H(Xi) |
If we happen to know that our data has this property or that existing dependencies are negligible, we can exploit this to greatly improve performance of entropy calculation, which is one of the most expensive parts of the whole algorithm. Therefore, it may sometimes be beneficial to use the above formula instead: we call this independent entropy.
Now we will consider variables that have substructure and call this deep entropy. If we do not do this, i.e. if we only consider topmost constructors of values as above, then we refer to shallow entropy. Shallow entropy is useful if sub-variables are (almost) identical for all values having a specific constructor, i.e. contribute little to nothing to the total entropy.
There is actually only a small difference to the calculation of shallow entropy as described in the previous subsection: whenever we compute the entropy of one variable, i.e. when we “recognize” its values, then we may add sub-variables associated with each constructor to the partitions of values arising from the entropy calculation. For dependent, deep entropy calculation this is formally expressed as follows:
|
where svars(c) returns the number of sub-variables for constructor c, and X1|cn is the n-th sub-variable of variable X1 limited to values tagged with constructor c.
We may again define an independent version of deep entropy calculation:
| Hid(X1,…,Xn) = |
|
|
Thus, we have finally defined four measures useful for determining average information contents, the deep, dependent entropy being the statistically exact one for algebraic datatypes.
Having a measure for average information contents, we can apply the usual heuristics as found in ID3 or C4.5 to select an input variable for partitioning datasets. The information gain is the number of bits gained by partitioning a dataset on an input variable and is defined as follows:
|
Of course, this gain could be defined analogously using a different information measure. ID3 applies the gain criterion by selecting the input variable that boasts the highest gain.
The gain ratio expresses the number of bits gained divided by the number of bits consumed by using a certain input variable for partitioning. The variable having the highest gain ratio is considered to be best:
| GainRatioA(X1,…,Xn) = |
|
Since possible sub-variables associated with variable A remain in the set of input variables, only the shallow entropy H(A) is required here.14.
Getting back to our example dataset, we now demonstrate how to choose one of the two visible input variables. The right hand side of the dataset is:
(Low, High). (Low, Low). (Low, High). (High, High).
Dependent entropy gives us a value of 1.5 bits, whereas independent entropy would give us 1.62 bits15. Splitting the dataset using variable meal would result in the following two partitions, the first one for meal values WienerSchnitzel, the second one for TapirSoup16:
(Large, Water) → (Low, High). (Small, Beer) → (Low, Low).
(True, Beer) → (Low, High). (False, Beer) → (High, High).
Entropy would now decrease to 1 bit, irrespective of whether we used the dependent or independent formula. Therefore, the information gain would be 0.5 bits for dependent entropy and 0.62 bits for independent entropy. Since the chosen input variable meal contains 2 bits of information in the matched (topmost) constructor, the dependent gain ratio is 0.25, the independent one 0.31.
If we had chosen input variable drink instead, the following partitions would arise, the first one for drink Water, the second one for Beer:
WienerSchnitzel Large → (Low, High).
WienerSchnitzel Small → (Low, Low). TapirSoup True → (Low, High). TapirSoup False → (High, High).
Here we would get dependent entropy of 1.19 bits and independent entropy of 1.38 bits 17, leading to a dependent information gain of 0.31 and an independent one of 0.25. Since the information consumed by choosing variable drink is 0.81 bits, the dependent gain ratio would therefore be 0.31 / 0.81 = 0.38, the independent one 0.25 / 0.81 = 0.30 (rounded).
Taking dependent gain ratio as selection criterion, we would favor variable drink (0.38) over meal (0.25). We will continue with the choice using dependent entropy, but would like to point out here that independent entropy would have chosen otherwise, since there variable drink has a gain ratio of 0.30 whereas the one for meal would have been 0.31. Our choice yields the following, still incomplete model:
let
(v1, v2) =
case drink of
Water → …
Beer → …
_ → …
in
(Very v1, v2)
Note that we also need to learn a model for cases when some specified kind of drink has never been observed before, as happens above for Wine. The unobserved constructors are omitted and a dummy binding “_” is used instead to indicate this. Common decision tree learning now simply estimates the most likely value for the default case18.
Continuing in the default branch of the previous model, we will show in this subsection how the most likely value of output data can be calculated. Let’s consider the following example data:
(Low, Low). (Low, Very High). (High, High). (Very High, High).
If we only considered each variable separately when choosing the most frequent element as estimate for the most likely one, we would choose the constructor Low for the first variable and the constructor High for the second, both of which occur twice. Thus, our estimate for the most likely value for both variables would be (Low, High). But interestingly, this estimate is itself not a value of the example data!
Therefore, calculating the most likely value independently is an
adventurous method. The machine learning system might predict observations
that it has actually never seen before, which is a kind of inductive
generalization that has not previously existed in traditional decision
tree learning. This property can be both an advantage and a disadvantage:
on the one hand, the system might benefit from this new generalization
capability in terms of accuracy while allowing very fast estimation of
this value. On the other hand, it may construct values in such a way
that they do not actually make sense. If the latter can lead to harmful
classifications, it’s better to not use this feature and apply dependent
calculation of the most likely value instead.
How do we calculate the most likely value dependently? There are several
useful ways of doing it. One is that we assume that all values are equally
weighted and that we pick the one that occurs most frequently, i.e. it
must be absolutely equivalent to other values. This, however, is not very
useful when we have to deal with many variables on the right hand side:
it would be extremely unlikely that two values are equivalent, making all
choices of observed values possible. On the other hand, this most likely
value can be computed in linear time with respect to the amount of data.
Another possibility is the following: we pick the value that has more
information in common with all other values in the dataset than any
other candidate. Unfortunately, this requires O(n2) time in the
worst case. Still, this method is especially a reasonable thing to
do if we assess classification performance with a similar measure of
commonality. Therefore, we leave the description of this method to
section 3.4 on assessing accuracy.
Back to our example and the default branch after choosing variable
drink, we now have the following right hand side of data:
(Low, High). (Low, Low). (Low, High). (High, High).
No matter, which method of computing the most likely value we choose from above, the estimate for the most likely value will be (Low, High). Therefore, our model now looks as follows:
let
(v1, v2) =
case drink of
Water → …
Beer → …
_ → (Low, High)
in
(Very v1, v2)
In the branches still missing for drinks Water and Beer we only need to start all over with the new partitions, normalizing and recursively partitioning as required. Finally, we would get the following model:
let
(v1, v2) =
case drink of
Water → (Low, High)
Beer →
case meal of
WienerSchnitzel _ → (Low, Low)
TapirSoup spiced →
let
v3 =
case spiced of
False → High
True → Low
in
(v3, High)
_ → (Low, High)
_ → (Low, High)
in
(Very v1, v2)
The big advantage of such models is, as we can see, that the let-abstractions allow us to predict parts of structured values before we have knowledge (due to discrimination) about others. It might be even more human readable if we used where-abstractions, which place the abstracted part after the one where it is to be inserted, but this is only a minor syntactic issue.
We first define information contents for terms: it expresses the number of bits required to encode a given term in a given set of type equations. If, for example, some value is of a sum type that consists of four constructors, we would need two bits to encode the topmost constructor. Then we continue with all substructures possibly associated with this constructor, counting bits as we go along. Given the type equations of our example, the information contents of Pizza (Cheese Mozzarella) would be log2(3) + log2(3) + log2(2) = 4.17 bits.
When comparing two values, we can now know how much information the two terms have in common by counting the number of bits of all constructors on which they agree. This gives us a numeric estimate of commonality. For instance, comparing the value from above to Pizza (Cheese Gorgonzola), we would get a common information contents of 3.17 bits (accuracy of 76%), because the two topmost constructors are the same. This way we can evaluate a model of the generalized decision tree learner on a test set and compare the results to the expected values, which allows us to compute a percentage of correctly predicted bits.
By computing the average common information contents for all values in a set, we can choose the one that has the highest amount of common bits with all other values, which is one possible method of calculating a most likely value.
Even though the algorithm can sometimes immediately see that parts of structures can be factored out using let-abstractions, in other cases this is only possible after sub-models have been learnt. In the standard decision tree learner C4.5 this transformation is coined suing and can there only be usefully applied to leaf nodes. E.g. consider this example data that maps a boolean to a satisfaction value:
True → Low. True → Low. True → High. False → Low. False → Low. False → High.
The right hand side is not unique. But after splitting according to the boolean and computing the most likely value for each branch, we get the following model:
case boolean of
True → Low
False → Low
As we can see, no matter what boolean we observe, the most likely value will be Low. Therefore, the choice can be eliminated and the model would simply reduce to the single leaf node Low. Generalizing this to algebraic datatypes, this is quite a bit trickier. Let’s extend the above example to two satisfaction values on the right hand side of the data:
True → (Low, Low). True → (Low, Low). True → (High, Low). False → (Low, High). False → (Low, High). False → (High, High).
This time the model for this data will look as follows:
case boolean of
True → (Low, Low)
False → (Low, High)
Now we cannot reduce this to a leaf node anymore. But we can factor out the left part of the pairs in each branch using a let-abstraction:
let
v1 =
case boolean of
True → Low
False → High
in
(Low, v1)
We would even have to consider common substructures and also expect cases where let-abstractions share common structure with leaf nodes. In extreme scenarios parts of a leaf node could even propagate through let-abstractions up to the top of the model! Though it is a bit challenging to implement this transformation, the algorithm itself is efficient and always yields a unique solution, i.e. a minimal factorization of a model19. The obvious advantage of such model rewritings is lower model complexity, which leads to more informative models for humans and allows more accurate model comparisons in terms of complexity for estimating the generalization capability.
We have so far left out the point that it may be convenient for the user to specify data using products of products. This can make specifications more modular, e.g.:
data Kind = Starter | Main data Size = Small | Large type Meal = (Kind, Size) type Meals = (Meal, Meal)
A legal value of type Meals could be:
((Starter, Small), (Main, Large))
To treat such nested products efficiently within the algorithm, it is generally a good idea to flatten them internally, i.e. that each product consists of sums only. We thus invent an internal type for Meals that looks as follows:
(Kind, Size, Kind, Size)
and whose values are converted accordingly.
The same flattening happens with arguments of constructors. This way the algorithm can directly iterate over sums (e.g. during entropy calculation) instead of having to decompose products for each computation. Unflattening, which we have to do when converting internal values back to their original representation, is as easy a transformation as flattening, because the flattened and unflattened specifications are isomorphic.
The AIFAD-implementation of the generalized algorithm has the same time complexity as ordinary decision tree learning, including structured data. Comparative experiments with C4.5 show that there is usually only a factor of two to five20 difference in speed, even though the implementation of the algorithm is not specialized towards unstructured representations, implemented in the high-level functional language OCaml21 rather than the low-level language C and not yet fully optimized.
What concerns memory consumption, there is a small disadvantage: since structured data does not allow for certain in-place operations without significant loss of time efficiency, memory requirements are higher than for C4.5. More precisely, there is an additional logarithmic factor involved, which depends on the number of samples. This, however, has only lead to memory problems in our experiments once: on a dataset containing about 600,000 samples and more than 40 variables, which is huge even in industrial terms. It very rarely happens that AIFAD consumes more than ten times as much memory as C4.5, and logarithmic factors usually become negligible with high numbers of samples.