The Problem of Induction and Machine Learning

An exchange with Bruce Nielson

Because of my fight against induction I have sometimes been described as an irrationalist. But if it is true that there is no induction, then it is irrational to continue to have faith in induction.

Who is more rational? The believer in generalizing induction, or the critic who tries hard to anticipate, and to eliminate, the mistakes that may lurk in his conjectures - and who warns the scientist that his admirable and growing scientific knowledge remains uncertain conjectural knowledge? Is it irrational to teach the distinction between truth and certainty? To say that we ought to search for truth, which we may discover, though we may never be certain of it?

- K.P. The Non-Existence of Probabilistic Inductive Support, p. 316


Bruce Nielson and Daniel C. Elton have recently uploaded a paper to the arXiv titled Induction, Popper, and Machine Learning (hereafter abbreviated IPML), which just so happen to be three subjects I’m intensely interested in. Bruce and I thought it would be fun to have a friendly “adversarial collaboration”, wherein I’ll critique his paper on my blog, and he’ll respond to my critique on his.

For readers unfamiliar with Bruce Nielson, he hosts The Four Strands blog and the The Theory of Anything podcast, both of which are great. In particular, check out his piece on determinism, which points out the all-important distinction between determinism and predictability (something even Popper missed), and his masterfully argued piece on animal suffering.

Meanwhile, Dan Elton is doing important work on FDA reform over on his More is Different blog (among other interests). I particularly enjoyed his piece titled A list of possible FDA reforms, where he provides a list of practical and actionable improvements to the FDA drug approval process (and pairs nicely with Scott Alexander’s piece Adumbrations Of Aducanumab).

And with that, let the critiques begin!

Induction From Scratch

There seems to be a significant amount of confusion in the twitter-verse about what exactly we mean when we speak of “The Problem of Induction” (TPoI ), so let’s start from the beginning.

The clearest formulation of TPoI I’ve seen comes from a single paragraph buried in a not-very-well-known paper Popper and Miller wrote in 1983, titled The Non-Existence of Probabilistic Inductive Support. I’ve included the paragraph itself in a footnote1 below, and will rewrite it here with added exposition and updated notation. (Note that I’m lifting some text from my original rewrite of Popper and Miller’s proof, available here).

Logical Entailment

We first introduce the concept of logical entailment. Given two statements $x, y$, we write

\[\begin{aligned} x &\vdash y \end{aligned}\]

if $x$ entails $y$. For example, if

\[\begin{aligned} &x =\text{All men are mortal}\\ &y =\text{Socrates is a man}\\ &z =\text{Socrates is mortal} \end{aligned}\]

we can write $(x \wedge y) \vdash z$, which can be read “$x$ and $y$, therefore $z$”. Another example is equation 2 in IPML

\[\begin{aligned} &S 1 \rightarrow White\\ &\neg White\\ &\therefore \neg S 1 \end{aligned}\]

which can be rewritten2 using the turnstile symbol $\vdash$ as:

\[\begin{aligned} &x =\text{S1 is white}\\ &y =\text{That bird is white}\\ &z =\text{That bird is S1}, \end{aligned}\]

where

\[(x \wedge \lnot y) \vdash (\lnot z).\]

This can be read “$x$ and not $y$, therefore not $z$”.

But what does “entailment” really mean? I like to think of it as an operation which allows us to “move truth around”, from one sentence over to the next. If $x$ “entails” $y$, what we’re saying is, if we’re willing to assume sentences on the left of the turnstile are true (which we will represent with the color green)

\[\begin{aligned} \textcolor{green}{(x \wedge y)}\vdash z, \end{aligned}\]

then we automatically know sentences on the right of the turnstile3 are true,

\[\textcolor{green}{(x \wedge y)\vdash z} .\]

In this way, we have moved the property is true from $(x \wedge y)$ over to $z$4.

Synonyms or near-synonyms for “logical entailment” you’re likely to encounter are: “logical consequence”, “derived”, “deduced”, “follows from”, “logically valid”, “necessarily true”, “necessary truth”, “thus”, “logical conclusion”, “therefore”, “hence”, etc. (And when Popper uses the word “logical”, this is what he has in mind - similarly with Deutsch and “derive”.)

Now consider an example where we don’t have logical entailment:

\[\begin{aligned} x =~&\text{All men are mortal} \\ y =~&\text{Socrates is a man} \\ z =~&\text{Water is comprised of two parts hydrogen}\\ ~&\text{ and one part oxygen.} \end{aligned}\]

True as \(z\) may be, it isn’t entailed by \((x \wedge y)\), so assuming the truth value of $x$ and $y$ won’t give us any additional insight into the truth value of $z$ - we have to get that from elsewhere:

\[\textcolor{green}{(x \wedge y)} \nvdash z .\]

Formalizing The Problem

Once we have the turnstile symbol in hand, The Problem Of Induction becomes trivial. We have a “general law” statement $G$, which is assumed to hold across time and space, and we have a set of observed evidence statements $e_1, e_2, …, e_N$, which are assumed to be true in a particular spatio-temporal region.

We refer to the conjunction of the observed evidence as $E = (e_1 \wedge e_2 \wedge … e_N)$. By definition, the general law $G$ logically entails each of the evidence statements

\[\begin{aligned} &G \vdash e_i &\forall e_i \in E. \end{aligned}\]

For example, consider evidence statements

\[\begin{aligned} &e_1 = \text{All observed swans in Austria in 1796 are white}\\ &e_2 = \text{All observed swans in Austria in 1797 are white}\\ &... \end{aligned}\]

And consider the generalization

\[\begin{aligned} &G=\text{All swans are white.} \end{aligned}\]

Then $G \vdash E$ is logically valid, and referred to as deductive inference.

(That is, truth can flow “downhill”, from the general $G$ to the specific $E$.)

However, $E \vdash G$ is logically invalid, and referred to as inductive inference. 5

(That is, truth cannot flow “uphill”, from the specific $E$ to general $G$. )

This is the classical Problem of Induction.6

A brief public service announcement

The problem of induction states that no matter how many evidence statements $e_i$ we have, truth can never flow from the specific $E$ to general $G$. However, if you see a bunch of white swans, the problem of induction does not prohibit you from:

These are all completely fine. What’s wrong with them?

For example, if I say “Susan is going to be late coming into work today, she has been late in the past”, this is a perfectly reasonable argument to make. Sure, it isn’t the case that Susan will be late today logically follows from Susan has been late in the past, but who is claiming it needs to? We rarely have, and don’t require, logical entailment between all clauses of an argument, because that would be completely stifling, and would make conversation impossible. And if we did require it, it would no longer be called “an argument”, it would be called “a proof”.

(Note this clears up the whole “you can’t get an ought from an is” thing too - by “get”, Hume meant logical entailment. Nothing prevents us from conjecturing an “ought” from an “is”.)

The problem of induction and machine learning

With the setup out of the way, we can now turn to the paper. I claim that a misunderstanding about TPoI has led Bruce to see a problem where none exists. (VM: Or perhaps not, see below for a response from Bruce.)

The problem he’s interested in is: Machine learning is induction, but machine learning works really well. But Popper disproved induction. So did Popper disprove machine learning, or does machine learning disprove Popper?

Popper claimed to have refuted the idea that induction provides a foundation for knowledge…Machine learning is also taught as being rooted in induction. Given the success of machine learning, does this mean Popper was wrong that induction is a refuted theory?

Induction, Popper, and Machine Learning, p.1 (emph. added, here and throughout)

In Section 2 they describe the problem of induction:

However, deductive logic does not allow us to generalize from a specific statement like this. The fact that [the first swan] is white does not allow us to assume that [the next swan] will also be colored white.

But what if we see hundreds or even thousands of swans and all of them are white? Is there some point at which we can rightly assume that we can now logically reason that all swans are white? In other words, is it valid to reason

\[\begin{aligned} x & = \text{Observed swans 1-1000 are white} \\ y & = \text{All swans are white}\\ x & \vdash y~?\\ \end{aligned}\]

The supposed ability to reason from specific statements to universal statements is the method of induction.

Induction, Popper, and Machine Learning, p.1-2 (notation changed for consistency with above.)

Do you see what happened there? Logical entailment has gotten mixed up with reasoning itself. The mistake persists into the next paragraph:

However, Hume pointed out that no matter how many specific statements we observe, we are never justified in reasoning to a universal statement.[7] And, in fact, the discovery of actual black swans showed that this was the case. This is because it is logically invalid to ever reason from a specific statement to a universal statement.

Induction, Popper, and Machine Learning, p.2

Logical entailment and reasoning are not the same thing. The problem of induction does not prohibit us from reasoning - it’s simply a statement about the limitations of logical entailment. From wiki:

Reason and logic can however be thought of as distinct, although logic is one important aspect of reason. Author Douglas Hofstadter, in Gödel, Escher, Bach, characterizes the distinction in this way: Logic is done inside a system while reason is done outside the system by such methods as skipping steps, working backward, drawing diagrams, looking at examples, or seeing what happens if you change the rules of the system.

The problem, I think, is the use of the word “justified”. When Hume spoke of “justification”, he meant deductive logic - i.e logical entailment. And when he asked if such-and-such is “justified”, like going from an “ought” to an “is”, or the specific to the general, what he meant was: “Is there any way to have factual statements logically entail moral statements” or “is there any way to have evidence statements logically entail general law statements”?

(Answer: No, but that’s okay, because we can conjecture instead).

But over time this has been forgotten, and people fill in their own meaning of the word “justified”, and conflate logical entailment with assuming, reasoning, and all sorts of other concepts which aren’t logical entailment. Of course we’re justified in reasoning towards general laws - how else are we supposed to get them?

Did Popper refute induction?

How can machine learning’s inductive roots be squared with Popper’s refutation of Induction?

Induction, Popper, and Machine Learning, p.4

You’ll often hear that “Popper refuted induction” or “Popper solved the problem of induction” or that “Popper showed induction is false” or “Popper showed induction is a myth” or a hundred other variants of this claim. While all sorta-true, phrasing like this is liable to conflate concepts like logical entailment, reasoning, and justifiable assumptions into a big muddy mess, as the emphasized sentences in the above quotation highlight. Forget that all important adjective “logical”, and you’re just a few steps away from thinking Popper disproved reasoning itself.

To avoid confusion, I suggest a better way to think about it is: Popper realized there was no problem of induction, and that the conjectural-deductive approach to knowledge was just fine. Guess a general law $G$, deduce the logical consequences $E$, and check to see if they are falsified by experiment, argument, or proof.

In symbols, Popper realized7 that while it is impossible for truth to flow from specific to general,

\[\begin{aligned} &\left(... e_{i-1} \wedge e_{i} \wedge e_{i+1} ... \right) \nvdash G, & &~~~~~(1)\\ \end{aligned}\]

what is possible is for falsity to flow from specific to general:

\[\begin{aligned} &\left(... e_{i-1} \wedge \neg e_{i} \wedge e_{i+1} ... \right) \vdash \neg G, & &(2)\\ \end{aligned}\]

and that this is the basis for all science. (See footnote here8 for a plain english translation of these equations.)

Rather than thinking of Popper as the person who “refuted induction”, it’s a bit more accurate to think of him as the person who argued against the dominant cultural orthodoxy of his time. This was an orthodoxy called Empiricism, the theory that all knowledge is derived - note the word, cf. above - from the senses.

“Empiricism”, read Bertrand Russell to the Aristotelian Society at 55, Russell Square, London, W.C.1, on April 6th, 1936, at 8 p.m.:

is “the theory that all knowledge is derived from sense experience.” Accepting this definition, three questions arise before we can discuss whether empiricism is true or false. We must ask what is meant by “knowledge,” what by “derived from,” and what by “sense experience.”

Russell, The Limits of Empiricism, p. 1 (emph. added)

Having been present in the audience that day, In Realism and the Aim of Science, Popper recounts the rest of the story (see footnotes for added commentary):

Believing that our empirical knowledge was obtained by induction9, and deeply impressed by the force of Hume’s criticism10, Russell suggested that we had to assume some principle of induction,11 which could not be based upon induction in its turn.

Realism and the Aim of Science, p. 12, (cont. below, emph. added throughout)

If we were to be unkind for pedagogical purposes, we might say that Russell was trying to use his “principle of induction” to weasel his way out of what Hume showed to be logically impossible. Popper, unimpressed, voiced his concerns, and was politely laughed at by all:

Having been invited to participate in the discussion, I said that I did not believe in induction at all… This statement (which I formulated as pointedly as I could manage with the little English at my disposal) was taken for a joke by the audience, who graciously laughed and clapped. I then suggested that the whole trouble was due to the mistaken belief that scientific knowledge was an especially strict or certain or august kind of knowledge. This statement met with the same reception as the first.

The “Problem of induction” is simply that everyone at the time was sure that knowledge (i.e absolutely true and certain general law \(G\)) was derived (i.e logically entailed) from the senses (i.e observed evidence statements \(E\)), but Hume showed this was logically impossible, and that was the problem.

This would be like if everyone believed in miracles, but then David Hume came around and showed that miracles were logically impossible, which created “The Problem of Miracles”, which Russell and a bunch of other philosophers tried to get out of by assuming “A Principle of Miracles”, which Popper argued against for his entire career, and became known as “the guy who refuted miracles”.

We should also remember that what Popper was challenging wasn’t some esoteric philosophy held by only a few fancypants philosophers - it was the very dictionary definition of the word “know”, used in implied agreement by every english speaker on the planet:

I concluded with an attempt to explain that, in the usual sense of ‘know’, whenever I know that it is raining, it must be true that it is raining; for if it is not true, then I simply cannot know that it is raining, however sincerely I may believe that I know it. In this sense of the word, ‘knowledge’ always means ‘true and certain knowledge’… But, I said, there was no such thing as scientific knowledge in this sense.

The audience was confused, baffled. What sort of priggish snob could say such an awful thing?

My little speech was well received, which was gratifying, but for the wrong reasons, which was puzzling. For it had been taken as an attack upon science, and perhaps even as an expression of a somewhat superior attitude towards it.

Admittedly, I had attacked, by implication, Science with a capital ‘S’, and those of its devotees, who were ready to take its pronouncements as gospel truth. … Thus the dismissal of Science with a capital ‘S’ although implicit in what I had said, had not been my main point at all.

Rather, what I had hoped to convey to the audience was this: if we assume that what is called ‘scientific knowledge’ consists only of guesses or conjectures, then this assumption is sufficient for solving the problem of induction.

This, my main point, was lost.

What Popper did wasn’t “refute induction”, what he did was demolish in writing every single counterargument against his position he could find, and then some extra ones, just to be safe - and in so doing, birthed the scientific method as we know it today.

(Counterarguments like: “Have you considered using the probability calculus!” or “But we actually care about certainty, not truth!” or “Maybe the probability calculus will give us a justification!” or “Yes but have you considered assuming the future will be like the past!” or “How about we all just try using the probability calculus again!” or … and then he died, of exhaustion).

So, in summary:

So what exactly did Popper refute?

Not induction - Hume did that - but all the various attempts to get around Hume’s refutation.

Then why is it common to hear that “Popper refuted induction”?

Because he took on a dogma, and became known as the guy who “refuted” induction. Karl Popper “refuted” induction like Christopher Hitchens “refuted” religion.

What about machine learning? Refuted by Popper?

No, and in Realism, Popper explains this himself. Writing of an “inductive machine”, a “simplified universe”, and “individual events with certain properties” (which we would now call “a computer”, “a dataset”, and “a feature vector”), he says:

13. An Inductive Machine.

In spite of my objections to the inductive theory of learning, I believe in the possibility of inductive machines — of a certain kind. In order to see the possibility of an inductive machine, we consider a simplified universe which the machine will have to investigate. The universe may consist, for example, of ‘individuals’ or ‘individual events’, with a certain limited number of properties. We choose for our individuals balls of one inch diameter. An individual event may consist in the appearance of a ball at the end of a pipe; it then rolls down a groove and disappears into a hole. There is to be such an individual event every second.

The balls may have certain properties. They may be of steel or of brass, and they may be painted in a number of different colours. We now assume that our artificial universe may be operated in accordance with a number of ‘natural laws’. For example, we may so adjust it that steel balls always come in sequences of three, and after these a longish but varying sequence of copper balls; or we may so adjust it that, out of each hundred steel balls, 10 balls are blue, 20 green, 30 red, and 40 yellow: this would be a statistical ‘law’. Other ‘laws’ can be arranged.

The induction machine will have to be equipped with a detector (perhaps a magnetic needle) which allows it to distinguish between steel and copper balls, and another detector for colours. Moreover, it will have to be equipped with counting devices. It will have to be able to draw up statistics of the various distinguishable occurrences, and to calculate averages. If it is further adjusted to the kind of law (not to the actual laws) which the universe may exhibit—laws of succession, general or conditional frequencies of a certain stability, etc.—then it may be so constructed as to be quite efficient, for example, in formulating hypotheses and in testing and eliminating them. Thus it may detect simple regularities in the universe without any difficulty: it may thus learn from experience.

Realism and the Aim of Science, p. 320, (emph. both his and mine, see original)

Then comes the hammer blow:

If a machine can learn from experience in this way, that is, by applying the simple inductive rule, is it not obvious that we can do the same? Of course I never said that we cannot learn from experience. Nor did I ever say that we cannot successfully use the simple inductive rule—if the objective conditions are appropriate. But I do assert that we cannot find out by induction whether the objective conditions are appropriate for using the simple inductive rule.

If our machine is successful in its stage (i), then it is so because we have constructed our ‘world’ in such a way that the simple inductive rule can be successfully applied to it.

Realism and the Aim of Science, p. 322, (ditto)

So the reason why machine learning doesn’t conflict with Popper/Hume’s refutation of induction is that the datasets have been specifically chosen such that induction in this limited setting is possible. This is how we square machine learning’s inductive roots with Popper’s refutation of induction. One last question:

Why is the problem of induction so difficult to understand?

We now see why our inductive machine is so simple, compared with almost all attempts to construct an inductive logic. The reason is this. We have not tried to establish the validity of our machine by any inductive logic, or to establish a link between the machine and probability logic. We have not done so because it cannot be done. An inductive logic is precisely an attempt to establish the impossible. It is this which makes books on inductive logic so complicated and loosely reasoned, and which is responsible for the introduction of endless and complicated substitutes for simple argument.

Realism and the Aim of Science, p. 324, (emph. mine).

Comparison of Methods

In Section 3.2 D&B propose a dichotomy between “Baconian induction” vs “statistical induction”, summarized in Table 1.

Baconian Induction Statistical Induction
Logically reasons from specific statements to universal statements Based on frequentism
Obtains certainty, or at least ‘justification’ for knowledge Offers no certainty nor justification
Is the basis for scientific explanatory hypotheses Is not the basis for scientific explanatory hypotheses. Instead it creates useful heuristics
Doesn’t exist Exists

I found this table a bit confusing. When we say “machine learning is induction”, what we mean is that we give the machine a dataset (“observations”), ask it to find some patterns, give it another dataset, and ask it to make predictions (“generalization to the unseen”).

This is related to, but different from, statistical inference and statistical modelling, which in this context can be thought of as methods for extracting useful information from data, given a number of statistical assumptions which we know in reality to be false.

(For example: IF we are willing to assume the number of hospital visits in December next year will resemble the number of hospital visits in December in previous years, THEN we should staff this many nurses and doctors next year.)

Viewed as an epistemology (which is about figuring out what is actually true) this is an obvious no-go, because we know the input-assumptions are false. However, viewed as a tool that can help us save lives, it’s not too bad. Machine learning and statistics are not epistemologies trying to ascertain true theories of reality - they are useful tools to help us find patterns in carefully curated datasets, in order for us to obtain actionable information, like how many nurses and doctors to staff next December.

I therefore propose the following update to Table 1:


Baconian Method / Bayesian Epistemology Scientific Method Machine Learning / Statistical Inference
Inductive Deductive Inductive
Probability Content Probability
Seeks certainty Seeks truth Seeks patterns
Theories cannot be shown to be true or false, only more or less probable Theories cannot be shown to be true, but can be shown to be false Is a tool, doesn’t theorize
A highly corroborated theory is one which is best supported by the evidence A highly corroborated theory is one which has resisted repeated attempt at falsification N/A
Knowledge consists of highly probable theories Knowledge consists of bold theories with high empirical content N/A
Knowledge is produced by induction/probabilistic induction Knowledge is produced by conjecture and criticism N/A
Refuted by Popper Proposed by Popper Predicted by Popper
(see above)

Modified from slides 45-46 here.



A Novel Metaphor-Based Metaheuristic

We now pivot to the second half of the paper.

Earlier, I mentioned that Bruce was interested in the question of reconciling Popper’s disproof of induction with the success of machine learning. This leads him to propose a novel optimization algorithm in Section 2.3, inspired by the work of Donald T. Campbell - The Universal Darwin algorithm (UD), which generalizes the entire textbook of Russell & Norvig.

UD is an evolutionary algorithm, part of a species of optimization algorithm known as “metaheuristics”, which have a bad reputation within the AI community for “vagueness, lack of conceptual elaboration, poor experiments, and ignorance of previous literature.”

Particularly prone are a type of evolutionary algorithm called “metaphor-based metaheuristics”, which are often inspired by naturally occurring phenomena. In 2012, Kenneth Sörensen, a researcher from the University of Antwerp, wrote a paper titled Metaheuristics - the metaphor exposed, describing the problem with these “metaphor-based metaheuristic” algorithms. He writes:

Imagine the following situation. On browsing your weekly copy of Nature, you stumble upon an article entitled “A novel food-based theory of particle physics.” In this paper, the authors claim to offer truly new insight into the working of the universe on its smallest scale.

The standard theory of particle physics, it is argued, has so far been rather unsuccessful in truly explaining the nature of the universe, like any of the other theories that precede it. By relating each force and each particle (called “ingredients” in the new theory) to its culinary equivalent, the author insists, a much more powerful explanation than all previous theories combined is exposed.

Quarks, for example, the particles of which protons and neutrons are composed, are called “meat,” whereas leptons (such as the electron) are “vegetables.” Combining “meat” with “vegetables,” the new theory suggests, evidently gives rise to a “dish” (an atom). Photons, the particles that transfer the electromagnetic force, the paper claims, can be best understood in terms of the “taste” that a dish produces.

Similarly, bosons, fermions, gluons, and all other elementary particles are related to some cookbook terms. The infamous Higgs boson that gives other particles their mass is, of course, the “salt” particle in the new theory, which gives other “dishes” their “flavor.

Sörensen continues:

There is not much doubt that the reaction of the scientific community over this article would be one of ridicule, if not outrage. In letters to the editor (who would probably be fired for allowing this article to be published), the question would rightfully be asked whether any new contribution was made other than a re-iteration of existing knowledge. The authors would be widely criticized for their attempt to change the vocabulary of the standard theory of particle physics, a tool that has been instrumental in allowing scientists across the world to communicate unambiguously. Steps would certainly be taken for this type of “research” never to be published again in a reputable journal.

The above story may strike the reader as very unlikely, yet it is not as far-fetched as it may seem at first sight. On the contrary, in the research field of optimization using metaheuristics, contributions akin to the hypothetical paper described above are frighteningly common. For a few decades, every year has seen the publication of several papers claiming to present a “novel” method for optimization, based on a metaphor of a process that is often seemingly completely unrelated to optimization.

The jumps of frogs, the refraction of light, the flowing of water to the sea, an orchestra playing, sperm cells moving to fertilize an egg, the spiraling movements of galaxies, the colonizing behavior of empires, the behavior of bats, birds, ants, bees, flies, and virtually every other species of insects – it seems that there is not a single natural or man-made process that cannot be used as a metaphor for yet another “novel” optimization method. Invariably, the authors of such papers promise that their “new” method is superior to methods previously published in the literature.

And with the Universal Darwin algorithm, we can now add “Popperian Epistemology” to the list of naturally occurring phenomena which have inspired metaphor-based metaheuristics:

The final Universal Darwin algorithm is therefore simply ‘variation and selection’ though to put a finer point on it we’ll summarize this meta-algorithm as:

  1. Start with a problem
  2. Conjecture a solution(s) to the problem (i.e. create variants)
  3. Measure how well the proposed solution(s) solve the problem
  4. Over time, retain the better solution(s)
  5. Repeat from step 2 until the problem is sufficiently solved or an optima is reached

(emph. mine)

The problem, of course, is that “Start with a problem” isn’t exactly something you can program in Python.

Optimization is a core component of machine learning, and has a long and rich mathematical history tracing back to Fermat, Lagrange, Newton and Gauss. To develop a novel optimization algorithm requires sophisticated mathematical techniques from a range of different areas, including at a minimum probability theory, linear algebra, and multivariate calculus. Combinatorial optimization is arguably even more challenging, as it deals with discrete parameters, meaning one loses access to gradient information which can be used to direct one towards a local or global optima.

We can tell UD is a metaphor and not an algorithm by running a simple mental test: If implemented, would the implementation be the same or different than any of the algorithms it claims to generalize? If same, metaphor. Else, novel generalization.

Developing new optimization algorithms is challenging, and requires a deep working familiarity with existing literature and existing techniques. Why we fell out of love with algorithms inspired by nature is because it’s much easier to develop novel metaphors than novel algorithms. Collect some criticism or start with a problem is excellent research advice for humans, who already have the ability to recognize and solve novel problems, but as instructions which can be programmed into a computer, it’s unfortunately too vague to be of much use.

This is why I view the Universal Darwin algorithm as a non-solution to the non-problem of reconciling Popper and machine learning - it’s “Universal” because it uses language so expansive that every single algorithm in Russell & Norvig can be described as falling under it:

A survey of Russell & Norvig, the most popular introductory text on AI, finds that many existing AI algorithms already utilize universal Darwinism as the basis for how they work. For example, all of the following popular AI algorithms are really evolutionary algorithms of variation and selection and thus fall under the Universal Darwin algorithm:

  1. Problem Solving by Searching Example: A-Star works by trying out every possible variant (often guided by an admissible heuristic) until the shortest path is found.

  2. Non-Optimal Search Algorithms Example: hill climbing, gradient descent, simulated annealing, and genetic algorithms all utilize variants and selection of the best variants.

  3. Adversarial Search Example: Minimax algorithms search every possible move utilizing a heuristic such as a board evaluation algorithm as a proxy for the best board position.

  4. Constraint Satisfaction Problems Example: Many CSP algorithms try out each possible combination to find an acceptable solution.

  5. Logic and Planning Problems Example: The DPLL algorithm does recursive depth-first search enumerating possible models.

Of course, Bruce knows that a theory which explains everything explains nothing, and he offers the naive Bayes classifier as an example of an algorithm which wouldn’t count as Universal Darwinism. But in equation 4 of IPML they have mistaken the probabilistic model for the classifier itself. The actual Naive Bayes classifier algorithm is12

\[\hat{y}=\underset{k \in\{1, \ldots, K\}}{\operatorname{argmax}} p\left(C_{k}\right) \prod_{i=1}^{n} p\left(x_{i} \mid C_{k}\right)\]

for class labels $\{C_k\}_{k=1}^K$ and feature vector $\mathbf{x} = [x_1, x_2,…, x_n]^T$. The argmax implies a for-loop, so the naive bayes classifier can be written in the style of the Universal Darwin algorithm as follows:

  1. Start with a problem: Given a feature vector $\mathbf{x}_i$, classify it into one of $K$ classes.
  2. Conjecture solutions to the problem: For $k \in \{1, \ldots, K\}$…
  3. Measure how well the proposed solution(s) solve the problem: …Compute the joint distribution $p\left(C_{k}\right) \prod_{i=1}^{n} p\left(x_{i} \mid C_{k}\right)$
  4. Over time, retain the better solution(s): Save whichever $k$ has the maximum value of the joint computed in step 3.
  5. Repeat from step 2 until the problem is sufficiently solved or an optima is reached: Problem is sufficiency solved, return $C_k$.

Something, something, “easy to vary”, cough, cough.

~

What is Bruce’s goal here though? He wants to be able to make the following claim:

\[\begin{aligned} x =~&\text{The Universal Darwin algorithm creates knowledge}\\ y =~&\text{Most existing AI algorithms are the Universal Darwin Algorithm}\\ z =~&\text{Most existing AI algorithms create knowledge.}\\ \end{aligned}\]

While it is true that \((x \wedge y) \vdash z\), this is only because he has defined knowledge creation circularly to mean “the output of the Universal Darwin Algorithm”:

Universal Darwinism works for the very simple reason that if you compare two (or more) variants as possible solutions to a problem and keep the better solutions – while discarding the worse ones – ultimately you will select and retain the best variants, at least until a maxima is reached. Whether or not it is a global or local maxima depends on the specifics of the problem and the chosen algorithm. For our purposes of this paper, we will refer to this process of improvement of variants as knowledge-creation.

Now we’re in the situation where the output of any algorithm at all counts as “knowledge-creation” - even purely inductive machine learning methods, which Popper showed could never produce knowledge.

I think it’s easy for people who have read David Deutsch to dismiss all AI researchers as being out to lunch because said researchers aren’t familiar with Popper. But there’s a reason we tend not to invoke Popperian concepts when developing new algorithms - it’s because we want these algorithms to actually work. Which means they have to be implementable on a computer (preferably in python and available open source on GitHub) and also means they have to be able to solve some novel real-world problem that existing algorithms couldn’t solve, like detecting handwritten digits, imaging distant black holes, predicting how proteins will fold, or creating stable multi-sided marketplaces like Uber Eats. The Universal Darwin Algorithm unfortunately falls short on both counts.

Addenda

Bruce and Dan were kind enough to provide initial reactions to an early draft of this post, and have asked for their responses to be included below.

Reactions from Bruce

I like the article. I think it contains one big misunderstanding of my position – though that is perhaps my own fault for not explaining it well in the limited space of the paper. But let me make my actual view clear to you here.

I completely agree with you that saying “Popper disproved induction” is a misleading statement at best. It’s only “true” if we start with the assumption that the term “induction” is not merely about a conjecture (as you correctly point out) but also includes justification. And in fact (as you also point out!) ‘induction’ *was* defined this way in the minds of most at the time of Popper and – more to the point – continues to be so in the minds of many today.

Thus when I initially speak of ‘induction’ and say things like ‘does machine learning challenge Popper?’ I am writing with the knowledge that many in the audience (apparently not yourself) have yet to realize that induction doesn’t have to contain the concept of justification. I’m setting up a problem that they can’t solve without accepting that the word induction might also sometimes point to a form that includes no justification and that machine learning is this form of induction. (Or at least in part. I’ll explain more in a more detailed response.) This is what I had in mind with the idea of ‘statistical induction’ being different that ‘baconian induction.’ But my answer to the question is “no, it does not pose a problem for Popper.”

So I feel you are accidentally misrepresenting my position when you say “I claim that a misunderstanding about TPoI has led Bruce to see a problem where none exists.” In fact, there clearly is a problem and it is exactly what you yourself go on to explain – that people dogmatically attach justification to the concept of induction even today. I can easily point to examples of this problem both in social media conversations but also coming from luminaries like David Deutsch. I’ve read multiple papers that make some form of this mistake. So this is a real enough problem.

Your point is that it’s a linguistics problem rather than a real problem and I am actually agreeing with you on that. So on this point, I think we actually agree but are going about it differently. You want to rehabilitate ‘induction’ to drop the implication of justification and I am just creating two different terms instead: baconian vs statistical induction. But conceptually this seems like we share a common goal and may even agree on this point.

Reactions from Dan

I agree 100% with the criticisms you’ve expressed and I’m extremely appreciative that you wrote this. Admittedly this paper was rushed to the press. We will be going back to improve it. The intention was not to say that inductive methods don’t exist. The paper was mainly written by Bruce, but one of the things I added was a plug for Gelman’s synthesis - that Bayesian methods exist and are useful, but ultimately (for any computationally bounded system) must be contained within a different process that operates at a higher level. I believe something similar holds for deep learning methods, which can sometimes be seen as approximations of Bayesian methods and in the very least are inductive in the sense of fitting patterns in big data.

My interest in this line of work stems from a dissatisfaction with the agents foundation framework that is currently dominant in the AI safety world which views all intelligent systems as approximations of Hutter’s idealized AIXI agent. I agree the framework of Universal Darwinism we discuss is currently too “meta” to say anything interesting, but I view it as a step towards a foundation for understanding intelligent agents which synthesizes both Popper and Bayes.

Footnotes

  1. The real idea of induction is this. We obtain (by observation, according to the usual belief) a number of statements such as “This swan is white” or “This raven is black”. Having not found any counter- example within a finite spatio-temporal region, - say Austria in 1796 - we can sum up these statements by what can be called the evidence e, for example: “All swans in Austria in 1796 are white” (or perhaps “All observed swans in Austria in 1796 are white”). We then generalize the evidence e, beyond the original spatio-temporal region (of observation). The evidence statement e can be described as the conjunction of the statements accepted as established (by observation). The statement that generalizes (extends) our knowledge beyond the region of established statements is the generalization g. For example, g may be “All swans in Europe during the 18th century are white”. Accordingly, the generalization always entails the evidence; and so, for any g that is a generalization of e, we have $g \vdash e$.

    The Non Existence Of Probabilistic Inductive Support, p.311-312

  2. I prefer the turnstile symbol over the standard notation because our eyes are used to reading statements from left to right, and because it emphasizes that we should think about logical operations as we do mathematical operations. So while mathematical operations use the equality symbol “$=$” and operate on numbers, logical operations use the turnstile symbol “$\vdash$” and operate on statements expressed in a natural language, like english. 

  3. The dictionary definition of “turnstile” is “a mechanical gate consisting of revolving horizontal arms fixed to a vertical post, allowing only one person at a time to pass through”, which is great intuition to have when thinking about moving truth from place to place. And the word infer means to “carry forward”

  4. If we consider the example

    \[\begin{aligned} &x =\text{The show starts at 7:00pm}\\ &y =\text{It is 7:15pm}\\ &z =\text{We are late for the show} \end{aligned}\]

    then \((x \wedge y) \vdash z\), and we see that logical entailment lets us extract useful secondary information from initial propositions, namely that if \((x \wedge y)\) is true, then \(z\) is true, and we should hurry. 

  5. i.e $E \nvdash G$. 

  6. I’m assuming you’re reading this because you’ve heard of the problem of induction from elsewhere, but if not, Ben and I discuss the “why should you care about any of this at all” question here

  7. Well Hume actually was the one to realize it, but let’s not complicate the storyline. 

  8. This warrants further clarification. In english, \(\neg e_{i}\) says “All observed swans in Austria in year $i$ were not white”, which logically entails the obvious, that “All swans are not white”. Thus the general law $G$ is considered falsified (\(\neg G\)). So while no amount of observed swans can tell us the truth of the general law $G$, it only takes a single observed non-white swan $\neg e_{i}$ to tell us that general law $G$ is false. 

  9. I.e equation (1), despite the fact it is logically invalid. 

  10. It was Hume who pointed out that equation (1) is, in fact, logically invalid 

  11. As a way to get around the fact that equation (1) is logically invalid 

  12. For an overview of Naive Bayes, see for instance, here or here