A theory of the learnable

A theory of the learnable Valiant, CACM 1984

(Also available in )

Today’s paper choice comes from the recommend study list of Prof. Margaret Martonosi’s 2015 Princeton course on “Great moments in computing.” A list I’m sure we’ll be dipping into again!

There is a rich theory of computation when it comes to what we can calculate. But as of 1984, no corresponding theory of what computers can learn. (We’re still debating theories of learning today). What is learnable by a computer, and what is not? And how should we think about this question?

The problem is to discover good models that are interesting to study for their own sake and that promise to be relevant both to explaining human experience and to building devices that can learn. The models should also shed light on the limits of what can be learned, just as computability does on what can be computed.

Forget deep learning, neural networks and so on for a moment. Valiant is striving for a deeper level of understanding, something fundamental that helps explain the limits of learnability, regardless of your particular learning algorithm of choice. We’ll need a suitably abstract definition of learning therefore, and Valiant provides us with this:

… we shall say that a program for performing a task has been acquired by learning if it has been acquired by any means other than explicit programming.

For the purposes of the paper though, Valiant narrows down the problem space to learning to recognise whether or not a given set of data is exemplar of a given concept. In other words, what can we say about binary classifiers?

We will have two parts to our learning machines: a learning protocol which specifies the manner in which information is obtained from the outside, and a deduction procedure by which a correct recognition algorithm for the concept to be learned is deduced. The learner must have access to a supply of data, for which Valiant provides an EXAMPLES routine which the learner may call at will. Each invocation produces one positive example at random based on the underlying (but unknown) distribution of examples in the problem domain. That sounds a bit like supervised learning of course, except that we’re only seeing examples of the positive class. However, we can make up our own data and present it to an ORACLE routine, which will inform the learner whether or not the data positively exemplify the concept.

One interesting application of formulating the problem this way is that human teachers can easily play the role of ORACLE. We may not be able to describe exactly how (for example) we deduce that we’re looking at an image of a cat — at least certainly not in such a way that a computer could reproduce our reasoning — but we certainly know an image of a cat when we see one. One way of generating labelled data today of course, is to use crowd-worker oracles!

Under these conditions, Valiant shows that it is possible to design learning machines with all three of the following properties:

  1. The machines can provably learn whole classes of concepts. Furthermore, these classes can be characterised.
  2. The classes of concepts are appropriate and nontrivial for general-purpose knowledge
  3. The computational process by which the machines deduce the desired programs requires a feasible (i.e., polynomial) number of steps.

That’s an impressive beginning for a theory of learnability! Somewhere though, Valiant suggests, there is a limit to what can be learned in this way, beyond which we might need a programmer to stitch together learned building blocks. The results of a fully developed learnability theory would then tell us the maximum granularity of the single concepts that can be acquired without programming. Empirically, we continue to push that boundary forward today.

The setup: learning Boolean functions

With no loss of generality, Valiant considers learning functions of boolean variables. There is no assumption of independence between these variables. I can encode anything I like that way. For example, I can have a boolean variable which indicates whether or not a certain input feature has the value “3”. I can also have a boolean variable that is a function of other variables. If we have t such Boolean variables, then there are 2^t combinations. There is a function f from these combinations to {0,1} (indicating whether or not a particular combination is an instance of the category we want to learn to detect).

Now consider a function F whose domain is a superset of f’s, and for which x \in \mathrm{dom}\ f \Rightarrow F(x) = f(x). (I.e., it gives us some additional variables on which f does not depend). This function F is what Valiant calls a concept.

Underlying the concept F is a probability distribution D over the set of all input vectors v. For every v such that F(v) = 1 latex , D(v) \geq 0.

(When we are given examples, each variable may take on the value 1, 0, or ‘unknown’).

A more precise definition of learnability

We can now say more precisely what Valiant means by ‘learnability.’

A class X of programs is learnable with respect to a given learning protocol if and only if there exists an algorithm A (the deduction procedure) invoking the protocol with the following properties:

  1. The algorithm runs in polynomial time in an adjustable parameter h, in the various parameters that quantify the size of the program to be learned, and in the number of variables t.
  2. For all programs f \in X and all distributions D over vectors v on which f outputs 1, the algorithm will deduce with probability at least 1 - h^{-1} a program g \in X that never outputs one when it should not but outputs one almost always when it should. (The sum of D(v) over all v where f(v) = 1 and g(v) \neq 1 is at most h^{-1}).

The problems considered in the paper do not require “two-sided errors” (the permission of both false positives and false negatives), but such a scenario can be modelled as a more general case.

Valiant’s framework has come to be known as probably approximately correct learning (PAC learning).

A combinatorial upper bound

Valiant gives results for conjunctive normal-form (CNF) expressions, disjunction normal-form (DNF) expressions, and arbitrary expressions in which each variable occurs just once. In all cases the probabilistic analysis depends on a lemma for a function L(h,S) defined for any real number h greater than one, and any positive integer S.

Let L(h,S) be the smallest integer such that in L(h,S) independent Bernoulli trials each with probability at least h^{-1} of success, the probability of having fewer than S successes is less than h^{-1}… The following simple upper bound holds for the whole range of values of S and h and shows that L(h,S) is essentially linear both in h and in S.

\displaystyle L(h,S) \leq 2h(S + \ln h)

Conjunctive normal-form expressions

A k-CNF expression is a CNF expression where each clause is the sum of at most k literals.

Key result:

For any positive integer k, the class of k-CNF expressions is learnable via an algorithm A that uses L = L(h, (2t)^{k+1}) calls of EXAMPLES and no calls of ORACLE, where t is the number of variables.

A quick sketch of the algorithm is as follows, which works a bit like a sieve: start off with a formula g that contains the product of all possible clauses of up to k literals. Ask for L examples, and for each provided example strike out all the clauses in g not determined to be true by the example. “The claim is that with high probability the value of g after the Lth execution of this block will be the desired approximation to the formula f that is being learned.”

Disjunctive normal-form expressions

A monotone DNF expression is one in which no variable is negated in it.

Key result:

The class of monotone DNF expressions is learnable via an algorithm B that uses L = L(h,d) calls of EXAMPLES and dt calls of ORACLE, where d is the degree of the DNF expression f to be learned and t the number of variables.

(The algorithm begins with an ‘empty’ function g , and adds just enough to g with each successive example to make g(v) = 1).

μ-expressions

A μ-expression is an expression in which each variable appears at most once. We can assume monotone μ-expressions (no negatives) since we can always relabel negated variables with new names that denote their negation.

Given a slightly more powerful oracle (that can answer questions of the form “does there exist an input vector w such that…”), μ-expressions are learnable in O(t^3) calls to the oracle.

Scratching the surface

I feel like I’ve only scratched the very surface of what Valiant has to tell us in this paper. The work is credited with launching the field of computational learning theory. Valiant also published a subsequent book in 2013, ‘Probably Approximately Correct,’ written for the lay person.