Previous Up Next

2  Intuitive example: Algebraic datatypes make tasty meals

Let’s put ourselves into the role of a restaurant owner who wants to revise his menu and prices to suit the tastes of his typical customers, thus allowing him to maximize his profit by optimizing his customers’ satisfaction. His unimaginative cook only knows how to prepare a small variety of meals and drinks, which he enumerates as follows:

  data Meal  = Pizza | WienerSchnitzel | TapirSoup
  data Drink = Water | Beer | Wine

The waitress was instructed to ask the customers about their satisfaction:

  data Satisfaction = Low | High

By gaining knowledge (learning) about the customers’ preferences or, in other terms, by finding a predictively most accurate function

  rate :: (Meal, Drink) → Satisfaction

from the set of meals and drinks1 to the set of satisfaction values, the restaurant owner hopes to improve his business. After having collected some observations regarding customer satisfaction, he leaves this task of finding the explaining function to his computer, which runs a classification program in the spirit of C4.52. It can output its results graphically as decision trees3, e.g.:


Figure 1: Decision tree (graphical representation)

Or as semantically equivalent text representation using case-statements:

  case meal of
    TapirSoup → High
    Pizza →
      case drink of
        Water → High
        Beer → Low
        Wine → Low
    WienerSchnitzel →
      case drink of
        Water → Low
        Beer → High
        Wine → Low

We will use textual representations as given above4 for the rest of this paper.

2.1  Cooking indistinguishable food can make meals uneatable

The restaurant owner’s greediness makes him ask for even more accurate predictions of customer satisfaction. From his cook he learns that it may well depend on additional attributes associated with each kind of meal, for example the diameter of the WienerSchnitzel, the kind of PizzaTopping or whether the TapirSoup is spiced or not.

How can we add this information in traditional representations as used e.g. by most decision tree learners and related approaches? These algorithms usually only expect a fixed number of attributes with a certain number of tags standing for discrete values, but in our case this may not be enough. One could imagine the following workarounds:

  1. Adding an attribute PizzaTopping is certainly possible, e.g.:
      data PizzaTopping = Cheese | Tomatoes | Mushrooms
    

    We would then have to revise the type of the function we want to learn to:

      rate :: (Meal, PizzaTopping, Drink) → Satisfaction
    

    But unfortunately, this would allow rather non-sensical models like:

      case meal of
        TapirSoup → High
        Pizza → Low
        WienerSchnitzel →
          case pizza_topping of
            Cheese → Low
            Tomatoes → Low
            Mushrooms → High
    

    Surely, nobody would seriously intend to eat a WienerSchnitzel with a PizzaTopping: a rather uneatable combination.

  2. Another solution attempt to get the menu recipes right could be to replace the tag Pizza with the tags CheesePizza, TomatoePizza and MushroomPizza. This approach is inappropriate for various reasons:
  3. Still another workaround suggests the introduction of an additional boolean attribute for each associated attribute to indicate whether it is relevant in a certain context or not. This leads to more redundant data, because irrelevant cases require a dummy value for the associated attribute. Furthermore, the machine learning system would require ways of figuring out when some attribute is actually relevant for classification, which is a waste of computation time.

The best we can do is therefore not to pretend that information can be represented with atomic elements (value tags, numeric variables) alone: we need ways to cope with structured data values.

2.2  Spicing meals with algebraic datatypes

Algebraic datatypes have become very widespread in modern functional and logic programming languages (e.g. Haskell, ML, Mercury, etc.) because of their nice formal properties. An introduction to the formal semantics of programming languages, including a thorough formal treatment of algebraic datatypes can be found in [Winskel1993]. We are highly convinced that modern developments in the field of formal semantics can be fruitfully applied to machine learning, too, as our work attempts to show. Previous attempts of introducing such concepts can be traced to e.g. [Flach2000] and [Bowers et al.2000], who try to integrate them with inductive logic and functional programming (ILP and IFP).

In contrast to their work we focus on algebraic datatypes, which have the convenient property that their values can only be constructed in a unique, unambiguous way. Thus, we can derive explicit notions of information contents and entropy for them in later parts of this work. The generalization of decision tree learning, which we will later describe, is essentially a step towards inductive functional programming (IFP), but the generalized data representation itself is oblivious to the style of induction as it is to the general-purpose programming languages mentioned above. It may therefore as well be usefully applied within the framework of inductive logic programming (ILP).

2.2.1  Meals that make a difference

The first concept necessary for structuring information is a facility that allows us to make a difference between values we are concerned about, i.e. being able to partition the set of all values into subsets. Without being able to distinguish between values, we could never gain information: all would be the same to us.

We can represent this partitioning by labeling values that belong to a specific domain with a unique name (tag) or, in other terms, by forming a discriminated union5 of the partitions of values in question. This indispensable concept is necessarily present in all representations, though most often in a limited form.

The limitation that often applies as, for example, in the mentioned attribute-value representation that is commonly used in data-mining, is that the discriminated union is only allowed for atomic values, i.e. ones that cannot be further described. These atomic values are exactly identified by their tag alone. E.g. we only say Pizza if it is not of interest or cannot be known whether the value in question has further attributes like PizzaTopping that distinguish it from others.

If, however, the value can be distinguished from others that otherwise share some property, we will have to mention the distinguishing factors. For instance, we might want to revise our representation of meals as follows:

  data Meal = Pizza PizzaTopping
            | WienerSchnitzel Diameter
            | TapirSoup Spiced

  data PizzaTopping = Cheese | Tomatoes | Mushrooms
  type Diameter     = Float
  type Spiced       = Bool
  data Bool         = False | True

Each of the tags Pizza, WienerSchnitzel and TapirSoup labels a value that is described by another attribute (another set of values). This way we can discriminate between meals more precisely. This other attribute may again be a discriminated union of sets (types) of values as in the data definition of PizzaTopping or equivalent6 to some other type, e.g. the floating point numbers Float7. The following are examples of values legal with respect to our upper specification:

  Pizza Mushrooms
  WienerSchnitzel 23.7
  TapirSoup True

Note that we can refine attributes even more deeply as with the following alternative definitions:

  data PizzaTopping = Cheese CheeseSort | Tomatoes | Mushrooms
  data CheeseSort   = Mozzarella | Gorgonzola

With this definition an example for a legal value would be:

  Pizza (Cheese Mozzarella)

Such nested discriminated unions also make data specifications much more modular: we can refine any part without having to change any of the other definitions as we would have to when applying the clumsy tag replacement workaround mentioned previously.

2.2.2  Products of ingredients allow for varied meals

Because we will often need more than one attribute to specify the detailed properties of some value, we will also require the concept of (Cartesian) products of sets (types) of values8. For example, the diameter of a Pizza may also yield valuable information about customer satisfaction. Thus, we could define meals as follows (the other definitions remain the same):

  data Meal = Pizza Diameter PizzaTopping
            | WienerSchnitzel Diameter
            | TapirSoup Spiced

A “legal” meal could then be:

  Pizza 22.86 (Cheese Gorgonzola)

Such products of types (here of Diameter and PizzaTopping) greatly increase the number of degrees of freedom we have to describe some value (e.g. some meal).

If we take a look at the combination of the two previously introduced concepts (sums and products), we notice that the commonly used representation of data, namely the attribute-value representation, is just a special case: ordinary decision tree learners require data represented as a product of non-structured sums, the product being the Cartesian product of all attributes (variables), the sum being used to collect values into the various attributes. Therefore, our proposal, which allows sums and products at any level to create structured values, is a true generalization of the more common representation.

What we would like to know now is how models that map between such data representations can look like, i.e. what kind of functions an extended machine learning system could induce. Let’s take this example:

  case meal of
    TapirSoup _ → High
    Pizza diameter topping →
      if diameter < 10 then Low
      else
        case topping of
          Cheese cheese_sort →
            case cheese_sort of
              Mozzarella → Low
              Gorgonzola → High
          Tomatoes → Low
          Mushrooms → High
    WienerSchnitzel _ →
      case drink of
        Water → Low
        Beer → High
        Wine → Low

We consider the case when a Pizza is served to explain how this model should be interpreted: its associated attributes Diameter and PizzaTopping are bound to the variables diameter and topping respectively. The right hand side of this case may now use these variables if required. Here, the diameter is compared to the value 10 in a conditional expression. If the pizza is too small, the resulting value of type Satisfaction will be Low. Otherwise it depends on the PizzaTopping, whether the customer will be satisfied. In the case of a Cheese-topping, we will also have to consider what sort of cheese we are observing.

Sometimes an associated attribute may be irrelevant to explain customer satisfaction. In our example, one can see this in the case of TapirSoup and WienerSchnitzel: their associated attributes are not considered, which is indicated by the dummy variable “_”, which can be used if some value should not be bound to an identifier.

The main concept behind the interpretation of our model is usually called pattern matching, because the data which we are analyzing is matched against patterns that contain variables. The henceforth bound variables can then be used on the right hand side of patterns for further computations. As can be hopefully seen, these kinds of models are an extremely expressive vehicle for describing functional relations between data structures.

2.3  Unbounded customer satisfaction with inductively defined data structures

Though algebraic datatypes (sum of products types) as presented so far are already fairly expressive, there is another simple concept that can be applied to increase expressiveness — even infinitely!

Our restaurant owner’s waitress might come to the conclusion that a simple-minded classification of customer satisfaction into Low and High may not be sufficient to reflect reality appropriately. What is missing so far is the possibility to speak about various degrees of satisfaction. Assuming that human satisfaction can be unbounded in both directions and given that people usually do not speak about their satisfaction in numeric terms, we need a discrete representation that can encode structured values of unlimited size.

The clever waitress therefore advises her boss to revise the definition of Satisfaction as follows:

  data Satisfaction = Low | High | Very Satisfaction

This definition differs from all previously presented ones as it mentions its type constructor9 Satisfaction not only on the left but also on the right hand side. This still leaves Low and High as values of satisfaction, but also allows for unbounded degrees like Very Low or Very (Very High), etc. Achieving this kind of unbounded expressiveness with usual encodings is impossible as was argued in section 2.1. Note that mutually recursive definitions of datatypes are also possible, i.e. the type constructor of one definition appears on the right hand side of some other definition and vice versa.

As a side note, it should be pointed out that there are always only finitely many models (functions) for non-recursive data structures, but (usually) infinitely many for recursive ones. More precisely, recursive data structures on the right hand side and non-recursive ones on the left hand side of a classification task usually imply a countable infinity of possible models, while in the opposite case there are even uncountably many ones.

Search space sizes for problems can be easily calculated by treating the sum operator in type definitions as arithmetic sum, the product as arithmetic product and the function type constructor (the “arrow”) as exponentiation, while data constructors are interpreted as the unit element. This is useful for getting an idea about the complexity of the learning task.

2.4  Partial knowledge of tastes

Since our new definition of Satisfaction also imposes structure on result (class) values, it may be interesting to point out another property we can exploit for machine learning: it is possible to only partially predict result values!

Let’s assume that exotic meals like TapirSoup will lead to “very” different degrees of satisfaction among customers. Some will like the exotic taste a lot, others might feel great disgust. So the satisfaction value might be either Very High or Very Low, but the argument to Very may not be clear in advance. Some machine learning system could therefore indicate to the user that the potential outcome of consumption will be extreme by predicting Very followed by a (possibly more biased) classification of the remaining structure. This may yield valuable information for risk-averse restaurant owners who dislike extremes and prefer easily predictable customer satisfaction. The commonly used atomic class values would not allow us to explicitly represent this kind of knowledge in a model.

Of course, this partial classification also applies to products of values: some parts of a product may already be uniquely predictable, whereas others still require further learning to be distinguishable.

An example combining both aspects of partial classification would be the following, where a pair (Cartesian product) of satisfaction values (e.g. for a couple having dinner) are predicted simultaneously:

  let v1 =
    case drink of
      Water → Low
      Beer → High
      Wine → High
  in
  (Very v1, Low)

This might indicate to the restaurant owner that men (left element of the pair) always tend to extreme levels of satisfaction, which, however, depend on the kind of drink, whereas their spouse would be mildly discontent in any case.


Submitted to: Machine Learning Journal, Special Issue on Inductive Logic Programming and Relational Learning
Copyright   ©  2003 Kluwer Academic Publishers
Author: Markus Mottlmarkus@oefai.at⟩
Previous Up Next