An exchange with Bruce Nielson
Because of my fight against induction I have sometimes been described as an irrationalist. But if it is true that there is no induction, then it is irrational to continue to have faith in induction.
âŠ
Who is more rational? The believer in generalizing induction, or the critic who tries hard to anticipate, and to eliminate, the mistakes that may lurk in his conjectures - and who warns the scientist that his admirable and growing scientific knowledge remains uncertain conjectural knowledge? Is it irrational to teach the distinction between truth and certainty? To say that we ought to search for truth, which we may discover, though we may never be certain of it?
- K.P. The Non-Existence of Probabilistic Inductive Support, p. 316
Bruce Nielson and Daniel C. Elton have recently uploaded a paper to the arXiv titled Induction, Popper, and Machine Learning (hereafter abbreviated IPML), which just so happen to be three subjects Iâm intensely interested in. Bruce and I thought it would be fun to have a friendly âadversarial collaborationâ, wherein Iâll critique his paper on my blog, and heâll respond to my critique on his.
For readers unfamiliar with Bruce Nielson, he hosts The Four Strands blog and the The Theory of Anything podcast, both of which are great. In particular, check out his piece on determinism, which points out the all-important distinction between determinism and predictability (something even Popper missed), and his masterfully argued piece on animal suffering.
Meanwhile, Dan Elton is doing important work on FDA reform over on his More is Different blog (among other interests). I particularly enjoyed his piece titled A list of possible FDA reforms, where he provides a list of practical and actionable improvements to the FDA drug approval process (and pairs nicely with Scott Alexanderâs piece Adumbrations Of Aducanumab).
And with that, let the critiques begin!
There seems to be a significant amount of confusion in the twitter-verse about what exactly we mean when we speak of âThe Problem of Inductionâ (TPoI ), so letâs start from the beginning.
The clearest formulation of TPoI Iâve seen comes from a single paragraph buried in a not-very-well-known paper Popper and Miller wrote in 1983, titled The Non-Existence of Probabilistic Inductive Support. Iâve included the paragraph itself in a footnote^{1} below, and will rewrite it here with added exposition and updated notation. (Note that Iâm lifting some text from my original rewrite of Popper and Millerâs proof, available here).
We first introduce the concept of logical entailment. Given two statements $x, y$, we write
\[\begin{aligned} x &\vdash y \end{aligned}\]if $x$ entails $y$. For example, if
\[\begin{aligned} &x =\text{All men are mortal}\\ &y =\text{Socrates is a man}\\ &z =\text{Socrates is mortal} \end{aligned}\]we can write $(x \wedge y) \vdash z$, which can be read â$x$ and $y$, therefore $z$â. Another example is equation 2 in IPML
\[\begin{aligned} &S 1 \rightarrow White\\ &\neg White\\ &\therefore \neg S 1 \end{aligned}\]which can be rewritten^{2} using the turnstile symbol $\vdash$ as:
\[\begin{aligned} &x =\text{S1 is white}\\ &y =\text{That bird is white}\\ &z =\text{That bird is S1}, \end{aligned}\]where
\[(x \wedge \lnot y) \vdash (\lnot z).\]This can be read â$x$ and not $y$, therefore not $z$â.
But what does âentailmentâ really mean? I like to think of it as an operation which allows us to âmove truth aroundâ, from one sentence over to the next. If $x$ âentailsâ $y$, what weâre saying is, if weâre willing to assume sentences on the left of the turnstile are true (which we will represent with the color green)
\[\begin{aligned} \textcolor{green}{(x \wedge y)}\vdash z, \end{aligned}\]then we automatically know sentences on the right of the turnstile^{3} are true,
\[\textcolor{green}{(x \wedge y)\vdash z} .\]In this way, we have moved the property is true
from $(x \wedge y)$ over to $z$^{4}.
Synonyms or near-synonyms for âlogical entailmentâ youâre likely to encounter are: âlogical consequenceâ, âderivedâ, âdeducedâ, âfollows fromâ, âlogically validâ, ânecessarily trueâ, ânecessary truthâ, âthusâ, âlogical conclusionâ, âthereforeâ, âhenceâ, etc. (And when Popper uses the word âlogicalâ, this is what he has in mind - similarly with Deutsch and âderiveâ.)
Now consider an example where we donât have logical entailment:
\[\begin{aligned} x =~&\text{All men are mortal} \\ y =~&\text{Socrates is a man} \\ z =~&\text{Water is comprised of two parts hydrogen}\\ ~&\text{ and one part oxygen.} \end{aligned}\]True as \(z\) may be, it isnât entailed by \((x \wedge y)\), so assuming the truth value of $x$ and $y$ wonât give us any additional insight into the truth value of $z$ - we have to get that from elsewhere:
\[\textcolor{green}{(x \wedge y)} \nvdash z .\]Once we have the turnstile symbol in hand, The Problem Of Induction becomes trivial. We have a âgeneral lawâ statement $G$, which is assumed to hold across time and space, and we have a set of observed evidence statements $e_1, e_2, âŠ, e_N$, which are assumed to be true in a particular spatio-temporal region.
We refer to the conjunction of the observed evidence as $E = (e_1 \wedge e_2 \wedge âŠ e_N)$. By definition, the general law $G$ logically entails each of the evidence statements
\[\begin{aligned} &G \vdash e_i &\forall e_i \in E. \end{aligned}\]For example, consider evidence statements
\[\begin{aligned} &e_1 = \text{All observed swans in Austria in 1796 are white}\\ &e_2 = \text{All observed swans in Austria in 1797 are white}\\ &... \end{aligned}\]And consider the generalization
\[\begin{aligned} &G=\text{All swans are white.} \end{aligned}\]Then $G \vdash E$ is logically valid, and referred to as deductive inference.
(That is, truth can flow âdownhillâ, from the general $G$ to the specific $E$.)
However, $E \vdash G$ is logically invalid, and referred to as inductive inference. ^{5}
(That is, truth cannot flow âuphillâ, from the specific $E$ to general $G$. )
This is the classical Problem of Induction.^{6}
The problem of induction states that no matter how many evidence statements $e_i$ we have, truth can never flow from the specific $E$ to general $G$. However, if you see a bunch of white swans, the problem of induction does not prohibit you from:
These are all completely fine. Whatâs wrong with them?
For example, if I say âSusan is going to be late coming into work today, she has been late in the pastâ, this is a perfectly reasonable argument to make. Sure, it isnât the case that Susan will be late today
logically follows from Susan has been late in the past
, but who is claiming it needs to? We rarely have, and donât require, logical entailment between all clauses of an argument, because that would be completely stifling, and would make conversation impossible. And if we did require it, it would no longer be called âan argumentâ, it would be called âa proofâ.
(Note this clears up the whole âyou canât get an ought from an isâ thing too - by âgetâ, Hume meant logical entailment. Nothing prevents us from conjecturing an âoughtâ from an âisâ.)
With the setup out of the way, we can now turn to the paper. I claim that a misunderstanding about TPoI has led Bruce to see a problem where none exists. (VM: Or perhaps not, see below for a response from Bruce.)
The problem heâs interested in is: Machine learning is induction, but machine learning works really well. But Popper disproved induction. So did Popper disprove machine learning, or does machine learning disprove Popper?
Popper claimed to have refuted the idea that induction provides a foundation for knowledgeâŠMachine learning is also taught as being rooted in induction. Given the success of machine learning, does this mean Popper was wrong that induction is a refuted theory?
Induction, Popper, and Machine Learning, p.1 (emph. added, here and throughout)
In Section 2 they describe the problem of induction:
However, deductive logic does not allow us to generalize from a specific statement like this. The fact that [the first swan] is white does not allow us to assume that [the next swan] will also be colored white.
But what if we see hundreds or even thousands of swans and all of them are white? Is there some point at which we can rightly assume that we can now logically reason that all swans are white? In other words, is it valid to reason
\[\begin{aligned} x & = \text{Observed swans 1-1000 are white} \\ y & = \text{All swans are white}\\ x & \vdash y~?\\ \end{aligned}\]The supposed ability to reason from specific statements to universal statements is the method of induction.
Induction, Popper, and Machine Learning, p.1-2 (notation changed for consistency with above.)
Do you see what happened there? Logical entailment has gotten mixed up with reasoning itself. The mistake persists into the next paragraph:
However, Hume pointed out that no matter how many specific statements we observe, we are never justified in reasoning to a universal statement.[7] And, in fact, the discovery of actual black swans showed that this was the case. This is because it is logically invalid to ever reason from a specific statement to a universal statement.
Induction, Popper, and Machine Learning, p.2
Logical entailment and reasoning are not the same thing. The problem of induction does not prohibit us from reasoning - itâs simply a statement about the limitations of logical entailment. From wiki:
Reason and logic can however be thought of as distinct, although logic is one important aspect of reason. Author Douglas Hofstadter, in GĂ¶del, Escher, Bach, characterizes the distinction in this way: Logic is done inside a system while reason is done outside the system by such methods as skipping steps, working backward, drawing diagrams, looking at examples, or seeing what happens if you change the rules of the system.
The problem, I think, is the use of the word âjustifiedâ. When Hume spoke of âjustificationâ, he meant deductive logic - i.e logical entailment. And when he asked if such-and-such is âjustifiedâ, like going from an âoughtâ to an âisâ, or the specific to the general, what he meant was: âIs there any way to have factual statements logically entail moral statementsâ or âis there any way to have evidence statements logically entail general law statementsâ?
(Answer: No, but thatâs okay, because we can conjecture instead).
But over time this has been forgotten, and people fill in their own meaning of the word âjustifiedâ, and conflate logical entailment with assuming, reasoning, and all sorts of other concepts which arenât logical entailment. Of course weâre justified in reasoning towards general laws - how else are we supposed to get them?
How can machine learningâs inductive roots be squared with Popperâs refutation of Induction?
Induction, Popper, and Machine Learning, p.4
Youâll often hear that âPopper refuted inductionâ or âPopper solved the problem of inductionâ or that âPopper showed induction is falseâ or âPopper showed induction is a mythâ or a hundred other variants of this claim. While all sorta-true, phrasing like this is liable to conflate concepts like logical entailment, reasoning, and justifiable assumptions into a big muddy mess, as the emphasized sentences in the above quotation highlight. Forget that all important adjective âlogicalâ, and youâre just a few steps away from thinking Popper disproved reasoning itself.
To avoid confusion, I suggest a better way to think about it is: Popper realized there was no problem of induction, and that the conjectural-deductive approach to knowledge was just fine. Guess a general law $G$, deduce the logical consequences $E$, and check to see if they are falsified by experiment, argument, or proof.
In symbols, Popper realized^{7} that while it is impossible for truth to flow from specific to general,
\[\begin{aligned} &\left(... e_{i-1} \wedge e_{i} \wedge e_{i+1} ... \right) \nvdash G, & &~~~~~(1)\\ \end{aligned}\]what is possible is for falsity to flow from specific to general:
\[\begin{aligned} &\left(... e_{i-1} \wedge \neg e_{i} \wedge e_{i+1} ... \right) \vdash \neg G, & &(2)\\ \end{aligned}\]and that this is the basis for all science. (See footnote here^{8} for a plain english translation of these equations.)
Rather than thinking of Popper as the person who ârefuted inductionâ, itâs a bit more accurate to think of him as the person who argued against the dominant cultural orthodoxy of his time. This was an orthodoxy called Empiricism, the theory that all knowledge is derived - note the word, cf. above - from the senses.
âEmpiricismâ, read Bertrand Russell to the Aristotelian Society at 55, Russell Square, London, W.C.1, on April 6th, 1936, at 8 p.m.:
is âthe theory that all knowledge is derived from sense experience.â Accepting this definition, three questions arise before we can discuss whether empiricism is true or false. We must ask what is meant by âknowledge,â what by âderived from,â and what by âsense experience.â
Russell, The Limits of Empiricism, p. 1 (emph. added)
Having been present in the audience that day, In Realism and the Aim of Science, Popper recounts the rest of the story (see footnotes for added commentary):
Believing that our empirical knowledge was obtained by induction^{9}, and deeply impressed by the force of Humeâs criticism^{10}, Russell suggested that we had to assume some principle of induction,^{11} which could not be based upon induction in its turn.
Realism and the Aim of Science, p. 12, (cont. below, emph. added throughout)
If we were to be unkind for pedagogical purposes, we might say that Russell was trying to use his âprinciple of inductionâ to weasel his way out of what Hume showed to be logically impossible. Popper, unimpressed, voiced his concerns, and was politely laughed at by all:
Having been invited to participate in the discussion, I said that I did not believe in induction at allâŠ This statement (which I formulated as pointedly as I could manage with the little English at my disposal) was taken for a joke by the audience, who graciously laughed and clapped. I then suggested that the whole trouble was due to the mistaken belief that scientific knowledge was an especially strict or certain or august kind of knowledge. This statement met with the same reception as the first.
The âProblem of inductionâ is simply that everyone at the time was sure that knowledge (i.e absolutely true and certain general law \(G\)) was derived (i.e logically entailed) from the senses (i.e observed evidence statements \(E\)), but Hume showed this was logically impossible, and that was the problem.
This would be like if everyone believed in miracles, but then David Hume came around and showed that miracles were logically impossible, which created âThe Problem of Miraclesâ, which Russell and a bunch of other philosophers tried to get out of by assuming âA Principle of Miraclesâ, which Popper argued against for his entire career, and became known as âthe guy who refuted miraclesâ.
We should also remember that what Popper was challenging wasnât some esoteric philosophy held by only a few fancypants philosophers - it was the very dictionary definition of the word âknowâ, used in implied agreement by every english speaker on the planet:
I concluded with an attempt to explain that, in the usual sense of âknowâ, whenever I know that it is raining, it must be true that it is raining; for if it is not true, then I simply cannot know that it is raining, however sincerely I may believe that I know it. In this sense of the word, âknowledgeâ always means âtrue and certain knowledgeââŠ But, I said, there was no such thing as scientific knowledge in this sense.
The audience was confused, baffled. What sort of priggish snob could say such an awful thing?
My little speech was well received, which was gratifying, but for the wrong reasons, which was puzzling. For it had been taken as an attack upon science, and perhaps even as an expression of a somewhat superior attitude towards it.
Admittedly, I had attacked, by implication, Science with a capital âSâ, and those of its devotees, who were ready to take its pronouncements as gospel truth. âŠ Thus the dismissal of Science with a capital âSâ although implicit in what I had said, had not been my main point at all.
Rather, what I had hoped to convey to the audience was this: if we assume that what is called âscientific knowledgeâ consists only of guesses or conjectures, then this assumption is sufficient for solving the problem of induction.âŠ
This, my main point, was lost.
What Popper did wasnât ârefute inductionâ, what he did was demolish in writing every single counterargument against his position he could find, and then some extra ones, just to be safe - and in so doing, birthed the scientific method as we know it today.
(Counterarguments like: âHave you considered using the probability calculus!â or âBut we actually care about certainty, not truth!â or âMaybe the probability calculus will give us a justification!â or âYes but have you considered assuming the future will be like the past!â or âHow about we all just try using the probability calculus again!â or âŠ and then he died, of exhaustion).
So, in summary:
Not induction - Hume did that - but all the various attempts to get around Humeâs refutation.
Because he took on a dogma, and became known as the guy who ârefutedâ induction. Karl Popper ârefutedâ induction like Christopher Hitchens ârefutedâ religion.
No, and in Realism, Popper explains this himself. Writing of an âinductive machineâ, a âsimplified universeâ, and âindividual events with certain propertiesâ (which we would now call âa computerâ, âa datasetâ, and âa feature vectorâ), he says:
13. An Inductive Machine.
In spite of my objections to the inductive theory of learning, I believe in the possibility of inductive machines â of a certain kind. In order to see the possibility of an inductive machine, we consider a simplified universe which the machine will have to investigate. The universe may consist, for example, of âindividualsâ or âindividual eventsâ, with a certain limited number of properties. We choose for our individuals balls of one inch diameter. An individual event may consist in the appearance of a ball at the end of a pipe; it then rolls down a groove and disappears into a hole. There is to be such an individual event every second.
The balls may have certain properties. They may be of steel or of brass, and they may be painted in a number of different colours. We now assume that our artificial universe may be operated in accordance with a number of ânatural lawsâ. For example, we may so adjust it that steel balls always come in sequences of three, and after these a longish but varying sequence of copper balls; or we may so adjust it that, out of each hundred steel balls, 10 balls are blue, 20 green, 30 red, and 40 yellow: this would be a statistical âlawâ. Other âlawsâ can be arranged.
âŠ
The induction machine will have to be equipped with a detector (perhaps a magnetic needle) which allows it to distinguish between steel and copper balls, and another detector for colours. Moreover, it will have to be equipped with counting devices. It will have to be able to draw up statistics of the various distinguishable occurrences, and to calculate averages. If it is further adjusted to the kind of law (not to the actual laws) which the universe may exhibitâlaws of succession, general or conditional frequencies of a certain stability, etc.âthen it may be so constructed as to be quite efficient, for example, in formulating hypotheses and in testing and eliminating them. Thus it may detect simple regularities in the universe without any difficulty: it may thus learn from experience.
Realism and the Aim of Science, p. 320, (emph. both his and mine, see original)
Then comes the hammer blow:
If a machine can learn from experience in this way, that is, by applying the simple inductive rule, is it not obvious that we can do the same? Of course I never said that we cannot learn from experience. Nor did I ever say that we cannot successfully use the simple inductive ruleâif the objective conditions are appropriate. But I do assert that we cannot find out by induction whether the objective conditions are appropriate for using the simple inductive rule.
If our machine is successful in its stage (i), then it is so because we have constructed our âworldâ in such a way that the simple inductive rule can be successfully applied to it.
Realism and the Aim of Science, p. 322, (ditto)
So the reason why machine learning doesnât conflict with Popper/Humeâs refutation of induction is that the datasets have been specifically chosen such that induction in this limited setting is possible. This is how we square machine learningâs inductive roots with Popperâs refutation of induction. One last question:
We now see why our inductive machine is so simple, compared with almost all attempts to construct an inductive logic. The reason is this. We have not tried to establish the validity of our machine by any inductive logic, or to establish a link between the machine and probability logic. We have not done so because it cannot be done. An inductive logic is precisely an attempt to establish the impossible. It is this which makes books on inductive logic so complicated and loosely reasoned, and which is responsible for the introduction of endless and complicated substitutes for simple argument.
Realism and the Aim of Science, p. 324, (emph. mine).
In Section 3.2 D&B propose a dichotomy between âBaconian inductionâ vs âstatistical inductionâ, summarized in Table 1.
Baconian Induction | Statistical Induction |
---|---|
Logically reasons from specific statements to universal statements | Based on frequentism |
Obtains certainty, or at least âjustificationâ for knowledge | Offers no certainty nor justification |
Is the basis for scientific explanatory hypotheses | Is not the basis for scientific explanatory hypotheses. Instead it creates useful heuristics |
Doesnât exist | Exists |
I found this table a bit confusing. When we say âmachine learning is inductionâ, what we mean is that we give the machine a dataset (âobservationsâ), ask it to find some patterns, give it another dataset, and ask it to make predictions (âgeneralization to the unseenâ).
This is related to, but different from, statistical inference and statistical modelling, which in this context can be thought of as methods for extracting useful information from data, given a number of statistical assumptions which we know in reality to be false.
(For example: IF we are willing to assume the number of hospital visits in December next year will resemble the number of hospital visits in December in previous years, THEN we should staff this many nurses and doctors next year.)
Viewed as an epistemology (which is about figuring out what is actually true) this is an obvious no-go, because we know the input-assumptions are false. However, viewed as a tool that can help us save lives, itâs not too bad. Machine learning and statistics are not epistemologies trying to ascertain true theories of reality - they are useful tools to help us find patterns in carefully curated datasets, in order for us to obtain actionable information, like how many nurses and doctors to staff next December.
I therefore propose the following update to Table 1:
Baconian Method / Bayesian Epistemology | Scientific Method | Machine Learning / Statistical Inference |
---|---|---|
Inductive | Deductive | Inductive |
Probability | Content | Probability |
Seeks certainty | Seeks truth | Seeks patterns |
Theories cannot be shown to be true or false, only more or less probable | Theories cannot be shown to be true, but can be shown to be false | Is a tool, doesnât theorize |
A highly corroborated theory is one which is best supported by the evidence | A highly corroborated theory is one which has resisted repeated attempt at falsification | N/A |
Knowledge consists of highly probable theories | Knowledge consists of bold theories with high empirical content | N/A |
Knowledge is produced by induction/probabilistic induction | Knowledge is produced by conjecture and criticism | N/A |
Refuted by Popper | Proposed by Popper | Predicted by Popper (see above) |
Modified from slides 45-46 here.
We now pivot to the second half of the paper.
Earlier, I mentioned that Bruce was interested in the question of reconciling Popperâs disproof of induction with the success of machine learning. This leads him to propose a novel optimization algorithm in Section 2.3, inspired by the work of Donald T. Campbell - The Universal Darwin algorithm (UD), which generalizes the entire textbook of Russell & Norvig.
UD is an evolutionary algorithm, part of a species of optimization algorithm known as âmetaheuristicsâ, which have a bad reputation within the AI community for âvagueness, lack of conceptual elaboration, poor experiments, and ignorance of previous literature.â
Particularly prone are a type of evolutionary algorithm called âmetaphor-based metaheuristicsâ, which are often inspired by naturally occurring phenomena. In 2012, Kenneth SĂ¶rensen, a researcher from the University of Antwerp, wrote a paper titled Metaheuristics - the metaphor exposed, describing the problem with these âmetaphor-based metaheuristicâ algorithms. He writes:
Imagine the following situation. On browsing your weekly copy of Nature, you stumble upon an article entitled âA novel food-based theory of particle physics.â In this paper, the authors claim to offer truly new insight into the working of the universe on its smallest scale.
The standard theory of particle physics, it is argued, has so far been rather unsuccessful in truly explaining the nature of the universe, like any of the other theories that precede it. By relating each force and each particle (called âingredientsâ in the new theory) to its culinary equivalent, the author insists, a much more powerful explanation than all previous theories combined is exposed.
Quarks, for example, the particles of which protons and neutrons are composed, are called âmeat,â whereas leptons (such as the electron) are âvegetables.â Combining âmeatâ with âvegetables,â the new theory suggests, evidently gives rise to a âdishâ (an atom). Photons, the particles that transfer the electromagnetic force, the paper claims, can be best understood in terms of the âtasteâ that a dish produces.
Similarly, bosons, fermions, gluons, and all other elementary particles are related to some cookbook terms. The infamous Higgs boson that gives other particles their mass is, of course, the âsaltâ particle in the new theory, which gives other âdishesâ their âflavor.
SĂ¶rensen continues:
There is not much doubt that the reaction of the scientific community over this article would be one of ridicule, if not outrage. In letters to the editor (who would probably be fired for allowing this article to be published), the question would rightfully be asked whether any new contribution was made other than a re-iteration of existing knowledge. The authors would be widely criticized for their attempt to change the vocabulary of the standard theory of particle physics, a tool that has been instrumental in allowing scientists across the world to communicate unambiguously. Steps would certainly be taken for this type of âresearchâ never to be published again in a reputable journal.
The above story may strike the reader as very unlikely, yet it is not as far-fetched as it may seem at first sight. On the contrary, in the research field of optimization using metaheuristics, contributions akin to the hypothetical paper described above are frighteningly common. For a few decades, every year has seen the publication of several papers claiming to present a ânovelâ method for optimization, based on a metaphor of a process that is often seemingly completely unrelated to optimization.
The jumps of frogs, the refraction of light, the flowing of water to the sea, an orchestra playing, sperm cells moving to fertilize an egg, the spiraling movements of galaxies, the colonizing behavior of empires, the behavior of bats, birds, ants, bees, flies, and virtually every other species of insects â it seems that there is not a single natural or man-made process that cannot be used as a metaphor for yet another ânovelâ optimization method. Invariably, the authors of such papers promise that their ânewâ method is superior to methods previously published in the literature.
And with the Universal Darwin algorithm, we can now add âPopperian Epistemologyâ to the list of naturally occurring phenomena which have inspired metaphor-based metaheuristics:
The final Universal Darwin algorithm is therefore simply âvariation and selectionâ though to put a finer point on it weâll summarize this meta-algorithm as:
- Start with a problem
- Conjecture a solution(s) to the problem (i.e. create variants)
- Measure how well the proposed solution(s) solve the problem
- Over time, retain the better solution(s)
- Repeat from step 2 until the problem is sufficiently solved or an optima is reached
(emph. mine)
The problem, of course, is that âStart with a problemâ isnât exactly something you can program in Python.
Optimization is a core component of machine learning, and has a long and rich mathematical history tracing back to Fermat, Lagrange, Newton and Gauss. To develop a novel optimization algorithm requires sophisticated mathematical techniques from a range of different areas, including at a minimum probability theory, linear algebra, and multivariate calculus. Combinatorial optimization is arguably even more challenging, as it deals with discrete parameters, meaning one loses access to gradient information which can be used to direct one towards a local or global optima.
We can tell UD is a metaphor and not an algorithm by running a simple mental test: If implemented, would the implementation be the same or different than any of the algorithms it claims to generalize? If same, metaphor. Else, novel generalization.
Developing new optimization algorithms is challenging, and requires a deep working familiarity with existing literature and existing techniques. Why we fell out of love with algorithms inspired by nature is because itâs much easier to develop novel metaphors than novel algorithms. Collect some criticism or start with a problem is excellent research advice for humans, who already have the ability to recognize and solve novel problems, but as instructions which can be programmed into a computer, itâs unfortunately too vague to be of much use.
This is why I view the Universal Darwin algorithm as a non-solution to the non-problem of reconciling Popper and machine learning - itâs âUniversalâ because it uses language so expansive that every single algorithm in Russell & Norvig can be described as falling under it:
A survey of Russell & Norvig, the most popular introductory text on AI, finds that many existing AI algorithms already utilize universal Darwinism as the basis for how they work. For example, all of the following popular AI algorithms are really evolutionary algorithms of variation and selection and thus fall under the Universal Darwin algorithm:
Problem Solving by Searching Example: A-Star works by trying out every possible variant (often guided by an admissible heuristic) until the shortest path is found.
Non-Optimal Search Algorithms Example: hill climbing, gradient descent, simulated annealing, and genetic algorithms all utilize variants and selection of the best variants.
Adversarial Search Example: Minimax algorithms search every possible move utilizing a heuristic such as a board evaluation algorithm as a proxy for the best board position.
Constraint Satisfaction Problems Example: Many CSP algorithms try out each possible combination to find an acceptable solution.
Logic and Planning Problems Example: The DPLL algorithm does recursive depth-first search enumerating possible models.
Of course, Bruce knows that a theory which explains everything explains nothing, and he offers the naive Bayes classifier as an example of an algorithm which wouldnât count as Universal Darwinism. But in equation 4 of IPML they have mistaken the probabilistic model for the classifier itself. The actual Naive Bayes classifier algorithm is^{12}
\[\hat{y}=\underset{k \in\{1, \ldots, K\}}{\operatorname{argmax}} p\left(C_{k}\right) \prod_{i=1}^{n} p\left(x_{i} \mid C_{k}\right)\]for class labels $\{C_k\}_{k=1}^K$ and feature vector $\mathbf{x} = [x_1, x_2,âŠ, x_n]^T$. The argmax
implies a for-loop
, so the naive bayes classifier can be written in the style of the Universal Darwin algorithm as follows:
- Start with a problem: Given a feature vector $\mathbf{x}_i$, classify it into one of $K$ classes.
- Conjecture solutions to the problem: For $k \in \{1, \ldots, K\}$âŠ
- Measure how well the proposed solution(s) solve the problem: âŠCompute the joint distribution $p\left(C_{k}\right) \prod_{i=1}^{n} p\left(x_{i} \mid C_{k}\right)$
- Over time, retain the better solution(s): Save whichever $k$ has the maximum value of the joint computed in step 3.
- Repeat from step 2 until the problem is sufficiently solved or an optima is reached: Problem is sufficiency solved, return $C_k$.
Something, something, âeasy to varyâ, cough, cough.
~
What is Bruceâs goal here though? He wants to be able to make the following claim:
\[\begin{aligned} x =~&\text{The Universal Darwin algorithm creates knowledge}\\ y =~&\text{Most existing AI algorithms are the Universal Darwin Algorithm}\\ z =~&\text{Most existing AI algorithms create knowledge.}\\ \end{aligned}\]While it is true that \((x \wedge y) \vdash z\), this is only because he has defined knowledge creation circularly to mean âthe output of the Universal Darwin Algorithmâ:
Universal Darwinism works for the very simple reason that if you compare two (or more) variants as possible solutions to a problem and keep the better solutions â while discarding the worse ones â ultimately you will select and retain the best variants, at least until a maxima is reached. Whether or not it is a global or local maxima depends on the specifics of the problem and the chosen algorithm. For our purposes of this paper, we will refer to this process of improvement of variants as knowledge-creation.
Now weâre in the situation where the output of any algorithm at all counts as âknowledge-creationâ - even purely inductive machine learning methods, which Popper showed could never produce knowledge.
I think itâs easy for people who have read David Deutsch to dismiss all AI researchers as being out to lunch because said researchers arenât familiar with Popper. But thereâs a reason we tend not to invoke Popperian concepts when developing new algorithms - itâs because we want these algorithms to actually work. Which means they have to be implementable on a computer (preferably in python and available open source on GitHub) and also means they have to be able to solve some novel real-world problem that existing algorithms couldnât solve, like detecting handwritten digits, imaging distant black holes, predicting how proteins will fold, or creating stable multi-sided marketplaces like Uber Eats. The Universal Darwin Algorithm unfortunately falls short on both counts.
Bruce and Dan were kind enough to provide initial reactions to an early draft of this post, and have asked for their responses to be included below.
I like the article. I think it contains one big misunderstanding of my position â though that is perhaps my own fault for not explaining it well in the limited space of the paper. But let me make my actual view clear to you here.
I completely agree with you that saying âPopper disproved inductionâ is a misleading statement at best. Itâs only âtrueâ if we start with the assumption that the term âinductionâ is not merely about a conjecture (as you correctly point out) but also includes justification. And in fact (as you also point out!) âinductionâ *was* defined this way in the minds of most at the time of Popper and â more to the point â continues to be so in the minds of many today.
Thus when I initially speak of âinductionâ and say things like âdoes machine learning challenge Popper?â I am writing with the knowledge that many in the audience (apparently not yourself) have yet to realize that induction doesnât have to contain the concept of justification. Iâm setting up a problem that they canât solve without accepting that the word induction might also sometimes point to a form that includes no justification and that machine learning is this form of induction. (Or at least in part. Iâll explain more in a more detailed response.) This is what I had in mind with the idea of âstatistical inductionâ being different that âbaconian induction.â But my answer to the question is âno, it does not pose a problem for Popper.â
So I feel you are accidentally misrepresenting my position when you say âI claim that a misunderstanding about TPoI has led Bruce to see a problem where none exists.â In fact, there clearly is a problem and it is exactly what you yourself go on to explain â that people dogmatically attach justification to the concept of induction even today. I can easily point to examples of this problem both in social media conversations but also coming from luminaries like David Deutsch. Iâve read multiple papers that make some form of this mistake. So this is a real enough problem.
Your point is that itâs a linguistics problem rather than a real problem and I am actually agreeing with you on that. So on this point, I think we actually agree but are going about it differently. You want to rehabilitate âinductionâ to drop the implication of justification and I am just creating two different terms instead: baconian vs statistical induction. But conceptually this seems like we share a common goal and may even agree on this point.
I agree 100% with the criticisms youâve expressed and Iâm extremely appreciative that you wrote this. Admittedly this paper was rushed to the press. We will be going back to improve it. The intention was not to say that inductive methods donât exist. The paper was mainly written by Bruce, but one of the things I added was a plug for Gelmanâs synthesis - that Bayesian methods exist and are useful, but ultimately (for any computationally bounded system) must be contained within a different process that operates at a higher level. I believe something similar holds for deep learning methods, which can sometimes be seen as approximations of Bayesian methods and in the very least are inductive in the sense of fitting patterns in big data.
My interest in this line of work stems from a dissatisfaction with the agents foundation framework that is currently dominant in the AI safety world which views all intelligent systems as approximations of Hutterâs idealized AIXI agent. I agree the framework of Universal Darwinism we discuss is currently too âmetaâ to say anything interesting, but I view it as a step towards a foundation for understanding intelligent agents which synthesizes both Popper and Bayes.
The real idea of induction is this. We obtain (by observation, according to the usual belief) a number of statements such as âThis swan is whiteâ or âThis raven is blackâ. Having not found any counter- example within a finite spatio-temporal region, - say Austria in 1796 - we can sum up these statements by what can be called the evidence e, for example: âAll swans in Austria in 1796 are whiteâ (or perhaps âAll observed swans in Austria in 1796 are whiteâ). We then generalize the evidence e, beyond the original spatio-temporal region (of observation). The evidence statement e can be described as the conjunction of the statements accepted as established (by observation). The statement that generalizes (extends) our knowledge beyond the region of established statements is the generalization g. For example, g may be âAll swans in Europe during the 18th century are whiteâ. Accordingly, the generalization always entails the evidence; and so, for any g that is a generalization of e, we have $g \vdash e$.
The Non Existence Of Probabilistic Inductive Support, p.311-312
I prefer the turnstile symbol over the standard notation because our eyes are used to reading statements from left to right, and because it emphasizes that we should think about logical operations as we do mathematical operations. So while mathematical operations use the equality symbol â$=$â and operate on numbers, logical operations use the turnstile symbol â$\vdash$â and operate on statements expressed in a natural language, like english.Â ↩
The dictionary definition of âturnstileâ is âa mechanical gate consisting of revolving horizontal arms fixed to a vertical post, allowing only one person at a time to pass throughâ, which is great intuition to have when thinking about moving truth from place to place. And the word infer means to âcarry forwardâ.Â ↩
If we consider the example
\[\begin{aligned} &x =\text{The show starts at 7:00pm}\\ &y =\text{It is 7:15pm}\\ &z =\text{We are late for the show} \end{aligned}\]then \((x \wedge y) \vdash z\), and we see that logical entailment lets us extract useful secondary information from initial propositions, namely that if \((x \wedge y)\) is true
, then \(z\) is true
, and we should hurry.Â ↩
i.e $E \nvdash G$.Â ↩
Iâm assuming youâre reading this because youâve heard of the problem of induction from elsewhere, but if not, Ben and I discuss the âwhy should you care about any of this at allâ question here.Â ↩
Well Hume actually was the one to realize it, but letâs not complicate the storyline.Â ↩
This warrants further clarification. In english, \(\neg e_{i}\) says âAll observed swans in Austria in year $i$ were not whiteâ, which logically entails the obvious, that âAll swans are not whiteâ. Thus the general law $G$ is considered falsified (\(\neg G\)). So while no amount of observed swans can tell us the truth of the general law $G$, it only takes a single observed non-white swan $\neg e_{i}$ to tell us that general law $G$ is false.Â ↩
I.e equation (1), despite the fact it is logically invalid.Â ↩
It was Hume who pointed out that equation (1) is, in fact, logically invalidÂ ↩
As a way to get around the fact that equation (1) is logically invalidÂ ↩
For an overview of Naive Bayes, see for instance, here or here.Â ↩