# Learning From Experience

###### 17th Jun 2020

###### Reading Time: 17 minutes

###### Epistemology, Philosophy

In my previous Epistemology post, I wrote about how we learn about the world around us, focusing mostly on the idea that much of what we think we know is told to us by people we trust, and we believe them. But how do we know we should believe them?

## Statistical Inference

*Statistics** *is the "collection, analysis, interpretation, and presentation of masses of numerical data". *Probabilities* are descriptions or predictions, expressed as a ratio of the event we're interested in divided by the number of other possibilities. For example, the chances of drawing any particular card out of a thoroughly shuffled deck of 52 cards is \( \frac{1}{52} \): the outcome we're looking for (1 card) divided by the total number of possible outcomes (52 cards).

By the same logic, the odds of drawing a `3`

card would be \( \frac{4}{52} = \frac{1}{13} \). The chances of drawing a card from a particular suit would be \( \frac{13}{52} = \frac{1}{4} \), the chances of drawing a red card are \( \frac{26}{52} = \frac{1}{2} \), and so on. These are *descriptive statistics* - statistics used to analyze numerical properties of a set of data, such as probabilities, when all the possibilities in that set are known.

Statistics can also be used to make predictions about an entire set of data when we only know some of the data. The idea is that there must be a true probability distribution for any set or "population" of data, even if we don't know what all those values are. *Statistical inference* is the idea that we can use what we *do* know about a specific set of data to make educated guesses about the parts we *don't* know. The more we know about a set of data, the safer the assumption that the data we have are *representative* of the data as a whole.

Pierre-Simon Laplace, a French polymath of the late 18th and early 19th centuries and a major early contributor to probability theory, proposed a basic example of statistical inference called the Rule of Succession. Laplace came up with it as an answer to the sunrise problem - "how likely is it that the sun will rise tomorrow?" It is a simple model for estimating the likelihood of a future occurrence (your "hypothesis") based on past experience:

\[ \frac{(s+1)}{(n+2)} \]

In this expression, *s* stands for the number of times your hypothesis has been true, and *n* is the number of opportunities where it could have been right or wrong. For example: what are the odds of the sun rising tomorrow, based purely on your personal experience?

Let's say that you're exactly 20 years old: 20 years times 365.25 days a year equals 7,305 days. That'll be *n*, the number of opportunities in your life the sun's had to rise in the morning. Since the sun has risen every single morning of your life, *s* will also be 7,305. Plugging these into the Rule of Succession, we get:

\[ \frac{(7,305+1)}{(7,305+2)} = \frac{(7,306)}{(7,307)} = 0.99986314492952 \]

Since the ratio \[ \frac{(7,306)}{(7,307)} \] doesn't reduce to a simpler fraction, I've converted it to a decimal number, 0.999863... When expressed this way, a probability* *is a number between 0 and 1 representing our quantifiable level of certainty based on the information we have about a larger, unknown data set - how many total sunrises there will ever be.

Because probabilities are ratios where the numerator is always a positive number equal to or smaller than the denominator, they can never be larger than 1 or less than 0. A probability of 1 would mean absolute certainty that our hypothesis is correct, while a probability of 0 would mean absolute certainty that it is wrong. A probability of 0.5 would mean it's equally likely to go either way.

0.999863 is pretty close to 1, meaning our certainty is high. Probabilities can be multiplied by 100 to convert them to a percentage, which may feel more familiar: this means we can be 99.9863% certain of having the sun rise tomorrow, at least if we base our guess only on the life experience of a 20-year-old. Pretty good odds.

Why isn't the Rule of Succession just \( \frac{s}{n} \)? That would make some intuitive sense - if we were merely asking for the frequency of sunrises in the past - the descriptive statistics about only the data we have, it *would* be 100%. However, the Rule of Succession is specifically asking what the chances are of the *next* event, recognizing that we don't know for sure how many sunrises will happen in the future.

## Imperfect Data

Statistical inference requires the recognition that our data is imperfect; we are basing these probabilities only on the data we have, and our data sets from the real world are always incomplete. Prediction in the real world is like drawing from a deck of cards where we don't know how many cards there are or what all the faces might be, so we can never be completely certain what will happen next. Something that hasn't happened before might happen next time; something that's been reliable for many years might fail.

The essential uncertainty of not knowing what all the possible values might be means that inferential probabilities can never have a probability of 0 or 1. In other words, we can never be completely certain a hypothesis is right or wrong until we know all of the outcomes - something which is often objectively impossible. This principle of uncertainty is called Cromwell's Rule, based on a memorable quote from a letter Oliver Cromwell wrote to the General Assembly of the Church of Scotland in 1650:

I beseech you, in the bowels of Christ, think it possible that you may be mistaken.

Without perfect data, perfect predictions can't be made. In the real world, we are almost always dealing with imperfect data. So how can we use statistical inference to make our best guesses?

## Starting From Indifference

A second reason that \( \frac{s}{n} \) would make for a lousy Rule of Succession is that it would tell us absolutely nothing about an unfamiliar situation. If we wanted to speculate about an event that we had no prior experience with, the expression would evaluate to \( \frac{s}{n} = \frac{0}{0} \). Division by zero is meaningless, so this tells us nothing.

Fortunately, there's another rule in probability that can help us out when confronted by a novel situation: the Principle of Indifference. This says that, given no prior information, we spread the probabilities out evenly over the number of possible outcomes. In other words, given two possibilities and no way of knowing what's more likely, the chances are even; each possibility gets a probability of 0.5 (50%).

Laplace's Rule of Succession accounts for this beginning uncertainty: filling in the Rule of Succession with zeros, we get:

\[ \frac{(0+1)}{(0+2)} = \frac{(1)}{(2)} =0.5 = 50\% \]

So with no evidence at all, a new hypothesis has equal chances of being true or false. As soon as we get more information, however, our ability to estimate probabilities will get better and better. For example, if the new hypothesis turns out to be false after the first attempt, our estimate of the probability of it being correct drops to \( \frac{(0+1)}{(1+2)} = \frac{1}{3} = 0.3\overline3 \) or 33.3%. Conversely, if it turns out to be correct, \( \frac{(1+1)}{(1+2)} = \frac{2}{3} = 0.6\overline6 \) or 66.6%.

As we gain experience with the hypothesis, these results will (we hope) start to converge toward the real ratio of possibilities in the larger, unknown set of data. The more times our hypothesis is validated, the more confidence we can have that it is correct. A simple example can illustrate this and maybe shed some light on our epistemological issue: how do we know we can trust someone and believe what they say?

## Who Can We Trust?

To answer whether or not we can trust someone, we have to start with a hypothesis: "this person is trustworthy." Then, as we accumulate experience with them, we can test that hypothesis and see how likely it is to be correct. One way we can evaluate their trustworthiness is by asking how often the things they say turn out be true.

These could be either statements of fact they make ("it's definitely going to rain today"), things they promise ("I will meet you at 7"), or any other assertion where we can easily determine whether their words matched up with the truth. Each time they tell the truth, we'll give them a point (*s*). Each time they lie, we'll increment the denominator (*n*) without giving them a point.

Where do we start? Well, how much do we trust a person we just met? Going back to the Principle of Indifference, we can set our trust at 0.5; perfectly even odds, 50/50, a coin toss, whether we can trust them or not. In reality, we use a number of conscious and unconscious biases to determine a starting point of trust. Some of these prejudices are fair and some of them aren't, but let's say for the sake of simplicity for now that these biases even out.

Let's introduce some characters: Alice, Bob, Carol, Dan, and Randy. After starting out each relationship unbiased, we've now had ten interactions with each of them:

Alice has been trustworthy 10 out of 10 times, a really reliable person: \[ \frac{(10+1)}{(10+2)} = \frac{11}{12} = 0.916\overline6 = 91.6\% \]

Bob has been trustworthy 9 out of 10 times - well, things happen, right?: \[ \frac{(9+1)}{(10+2)} = \frac{10}{12} = 0.83\overline3 = 83.3\% \]

Carol has only come through for us 6 out of 10 times - kind of a flake: \[ \frac{(6+1)}{10+2)} = \frac{7}{12} = 0.583\overline3 = 58.3\% \]

Dan is a real stinker, only telling the truth 3 out of 10 times: \[ \frac{(3+1)}{(10+2)} = \frac{4}{12} = 0.33\overline3 = 33.3\% \]

Randy's completely unpredictable, having told the truth 5 out of 10 times: \[ \frac{(5+1)}{(10+2)} = \frac{6}{12} = 0.5 = 50\% \]

The converse of a probability is the probability that your hypothesis is false. Alice, who seems pretty reliable, can still be predicted to be untrustworthy \( 1 - 0.916\overline6 = 0.083\overline3 \) or 8.3% of the time. Dan, however, who is only trustworthy 3 out of 10 times, can be counted on to let us down \( 1 - 0.3\overline3 = 0.6\overline6 \) or 66.6% of the time.

As we gain more experience with someone, we gain more data to feed into our equation. For example, if Alice turns out to be trustworthy one more time, her trustworthiness improves:

\[ \frac{(11+1)}{(11+2)} = \frac{12}{13} = 0.923 = 92.3\% \]

If Alice lets us down, however, her trustworthiness decreases:

\[ \frac{(10+1)}{(11+2)} = \frac{11}{13} = 0.846 = 84.6\% \]

Dan, who's pretty untrustworthy in our estimation, would have to work really hard to regain our trust. He'd have to be trustworthy all of the next 4 chances just to get back to 50%:

\[ \frac{(7+1)}{(14+2)} = \frac{8}{16} = \frac{1}{2} = 0.5 \]

He'd have to act trustworthy the next * 36 times in a row* to get to the level of trust Bob was at after the first 10 opportunities:

\[ \frac{(39+1)}{(46+2)} = \frac{40}{48} = \frac{5}{6} = 0.83\overline3 \]

By contrast, after 46 iterations, if Alice continued to prove reliable every time, her reliability would reach 0.979. If, at opportunity 47, she slipped up and acted untrustworthy, her reliability would only decrease to 0.959:

\[ \frac{(46+1)}{(47+2)} = \frac{47}{49} = 0.959 \]

If, starting at opportunity 47, Alice were to act untrustworthy seven times in a row - the same number of times Dan failed in his first 10 opportunities - she would only fall to 0.87:

\[ \frac{(46+1)}{(52+2)} = \frac{47}{54} = 0.87 \]

While Dan, at the same number of opportunities, having acted reliably 42 times in a row since his initial missteps, still lags behind her at 0.851:

\[ \frac{(45+1)}{(52+2)} = \frac{46}{54} = 0.851 \]

There's an old saying, "*you never get a second chance to make a first impression*". Once an impression of misbehavior has been established, it can be very difficult to correct it, while someone with an excellent track record can slip here or there without damaging their reputation too much. This very basic approach to probability illustrates why: when data sets are small, small differences in numbers create big differences in probabilities. As the data grows, small differences in the data make less and less difference in the resulting probabilities.

In the graph above, which illustrates these examples, Alice acts trustworthy from the first opportunity to the 46th, and then untrustworthy from opportunity 47 - 60. Dan acts untrustworthy his first 7 opportunities, and then works hard to make up for it, acting trustworthy from opportunity 8 - 60. Despite all her good behavior, Alice's trustworthiness never reached 1 and never could, due to Cromwell's Rule. It took Dan 43 opportunities to catch up with Alice, and only because she started failing. If Alice had kept up her trustworthiness, Dan would never have caught up.

So, the Rule of Succession gives us a very basic way to estimate probabilities, based only on past results. It illustrates many of the basic concepts behind inferential probability in a way that's hopefully easy to grasp: the inherent uncertainty of the future, the way probabilities build from a neutral initial position to greater and greater certainty, and the impact of additional information on future estimates.

But this simplistic model assumes that we can tell when we've been lied to and when we haven't. But how do we actually know that? To answer that question, we have to delve into *conditional probability*, which gives us a framework for modeling how we can iterate on our probabilities, factoring in new evidence as we get it. To do this, we use Bayes' Theorem.

## Bayesian Inference

Bayes' Theorem was developed by Thomas Bayes in *An Essay towards solving a Problem in the Doctrine of Chance*, published and read to the Royal Society posthumously in 1763. The Theorem is a statement of *conditional probability*, meaning a probability that takes specific evidence into account. It is the basis of *Bayesian inference*, a process of improving estimates of future probabilities based on additional evidence. Bayes' Theorem presented in modern probability notation is:

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

Where:

- \( P(A) \) is the Probability of A
- \( P(B) \) is the Probability of B
- \( P(A|B) \) is the Probability of A, given the evidence B
- \( P(B|A) \) is the Probability of B, given the evidence A

\( P(A) \) and \( P(B) \) are *independent* probabilities. They represent our best guess at the likelihood of those events, without regard for other evidence. \( P(A|B) \) and \( P(B|A) \) are *conditional* probabilities. They represent the probability of an event (the first term) if another event (the second term) occurs.

*The Boy Who Cried Wolf* is one of Aesop's Fables and a well-known morality tale in the Western world. The story is about a shepherd boy who grows bored and decides to amuse himself by crying out that a wolf is attacking his flock. His fellow villagers rush to help, but there is no wolf. He does it again, and the villagers once again come to help, but find no wolf.

The third time, a wolf really does come, but the shepherd's cries are ignored by the villagers, who have learned not to trust him. We can analyze this story using Bayes' Theorem to see how the villagers' trust in the shepherd boy was eroded over time, based on the evidence that he's not a reliable reporter of wolf attacks.

Let's start by defining our independent probabilities. We'll say that \( P(A) \) is the village's best estimate of the probability of a wolf attack. We can relabel it as \( P(\text{Wolf}) \) for clarity. \( P(B) \) will be our best guess at the probability of the shepherd boy crying "wolf" - for any reason whatsoever. We'll rename it \( P(\text{Cry}) \).

Now let's define our conditional probabilities. \( P(A|B) \) will be our estimate of the probability that a wolf is attacking, based on the shepherd's cries. We can restate this as \( P(\text{Wolf}|\text{Cry}) \) for clarity.

Finally, we'll need to estimate the probability that the boy will cry "wolf" when there really is a wolf, \( P(\text{Cry|Wolf}) \). Putting those into Bayes' Theorem, our equation now looks like:

\[ P(\text{Wolf|Cry}) = \frac{P(\text{Cry|Wolf})P(\text{Wolf})}{P(\text{Cry})} \]

The values in this equation represent:

\( P(\text{Wolf|Cry}) \): The probability of a Wolf, given that the boy is crying wolf. This is what we're trying to estimate.

\( P(\text{Cry|Wolf}) \): Our estimate of the probability of the boy crying wolf, when there really is a Wolf.

\( P(\text{Wolf}) \): Our estimate of the base probability that a Wolf will actually appear on a given day.

\( P(\text{Cry}) \): Our estimate of the probability of the boy crying wolf, for any reason.

Now we need to estimate the probabilities that make up the inputs for our equation. What's the probability the boy will cry wolf when there *really is* a wolf? For now, let's assume for simplicity that the boy will *always* raise the alarm if there's really a wolf: \( P(\text{Cry|Wolf}) = 1 \)

Next, what's the actual probability of a wolf showing up on any given day? Let's say that, based on the village's prior experience (using the Rule of Succession or similar), we estimate that wolves will show up to harass the village flock about 5 days out of 100: \( P(\text{Wolf}) = 0.05 \)

Finally, what's the probability that the boy will cry wolf, regardless of the reason? To begin with, let's say he'll only cry wolf if there is a wolf, so the probability of him crying wolf is identical to the probability of a wolf showing up: \( P(\text{Cry}) = 0.05 \)

Putting all these together, the equation becomes:

\[ P(\text{Wolf|Cry}) = \frac{1(0.05)}{0.05} \]

\[ P(\text{Wolf|Cry}) = \frac{0.05}{0.05} \]

\[ P(\text{Wolf|Cry}) = 1 = 100\%\]

So the boy starts out 100% trustworthy. Of course, this would not happen with real data, because none of the prior probabilities would be 0 or 1. But it's a good starting point for this example. Now let's see what happens when he starts to prove untrustworthy.

Let's say the boy has built up 100 days of trustworthiness by the time he's given the shepherd job - after all, he probably grew up in the village so they've known him his whole life, and they wouldn't give an important, unsupervised job to an untrustworthy young man. However, by the afternoon of his first day, the boy gets bored. He thinks it will be funny to call the alarm falsely, so he cries "Wolf! Wolf!" The townsfolk all grab their pitchforks and slings and run to the field, only to see the boy laughing merrily at their concern. They are, as I said, a generous and trusting folk, and some may chuckle a little at the prank - but this definitely hurts the town's opinion of the boy's trustworthiness.

The villagers now know that the shepherd might cry wolf when there *isn't* one. We can use the Rule of Succession \( \frac{s+1}{n+2} \) ratio to see how likely they think this is. The boy has a total of 100 trustworthy days out of 101 opportunities, so:

\[ \frac{(100 + 1 )}{(101 + 2)} = \frac {101}{103} = 0.98 \]

Subtracting 0.95 from 1 will give us an estimate of the likelihood that the boy is crying wolf when there *isn't* one: \( 1 - 0.98 = 0.02 \), or 2% of the time.

In order to include this new information in our denominator, we'll need to break down how we calculate \( P(\text{Cry}) \), which is the probability that the boy will cry wolf for any reason. We can break this down into the sum of all probabilities why the boy might cry wolf: the probability that he'll cry wolf because there *is* a wolf, plus the probability that he'll cry wolf when there isn't a wolf. We could even break it down further into every possible reason why he might cry wolf, but let's stick to these two for now. Here's the basic equation for calculating the denominator in Bayes' Theorem:

\[ P(B) = P(B|A)P(A) + P(B|\neg A)P(\neg A) \]

This introduces a new term, \( \neg A \) ("not A"), which just means any event *other* than A. The expression simply means that \( P(B) \) is equal to the sum of all its possible causes: either B is caused by A, or B is caused by something *other* than A. If we wanted to break it down into more possibilities, we could, by simply adding more probabilities like \( P(B|C)P(C) \), \( P(B|D)P(D) \), and so on, but for now let's just stick to the two possibilities: \( P(A) \) or \( P(\neg A) \).

Before, when the villagers assumed he was perfectly trustworthy, there *was* no "not A" possibility in their minds, which is why \( P(B) \) ended up equal to \( P(A) \). Now, however, they have a value for how often he might cry wolf when there isn't one: 0.02. Since wolves show up 5% of the time, they can also derive a value for how often there *isn't* a wolf: \( 1 - 0.05 = 0.95 \). So let's factor in this new information about the shepherd boy's reliability:

\[ P(B) = P(B|A)P(A) + P(B|\neg A)P(\neg A) \]

\[ P(\text{Cry}) = P(\text{Cry|Wolf})P(\text{Wolf}) + P(Cry|\neg \text{Wolf})P(\neg \text{Wolf}) \]

\[ P(\text{Cry}) = 1(0.05) + 0.02(0.95) \]

\[ P(\text{Cry}) = 0.05 + 0.019 \]

\[ P(\text{Cry}) = 0.069 \]

So now the villagers think the boy is 6.9% likely to cry wolf on any given day, even though there's only a 5% chance of there actually being a wolf. Now, let's redo our equation with that denominator and see how trustworthy they think he is after this first prank:

\[ P(\text{Wolf|Cry}) = \frac{P(\text{Cry|Wolf})P(\text{Wolf})}{P(\text{Cry})} \]

\[ P(\text{Wolf|Cry}) = \frac{1(0.05)}{0.069} \]

\[ P(\text{Wolf|Cry}) = \frac{0.05}{0.069} \]

\[ P(\text{Wolf|Cry}) = 0.724 = 72.4\% \]

So in one thoughtless act, the shepherd boy has gone from 100% trustworthy to only 72.4% trustworthy; the villagers now think that if the boy cries wolf, there's only a 72.4% chance that there actually is a wolf. What happens if he repeats his prank the following day?

\[ \frac{(100+1)}{(102+2)} = \frac{101}{104} = 0.97 \]

\[ P(\text{Cry}) = 1(0.05) + 0.03(0.95) \]

\[ P(\text{Cry}) = 0.05 + 0.0285 \]

\[ P(\text{Cry}) = 0.0785 \]

\[ P(\text{Wolf|Cry}) = \frac{1(0.05)}{0.0785} \]

\[ P(\text{Wolf|Cry}) = \frac{0.05}{0.0785} \]

\[ P(\text{Wolf|Cry}) = 0.636 = 63.6\% \]

On the third day, the villagers' confidence that there's actually a wolf has dropped to 63.6% - less than 2/3 odds that there's actually a wolf. No wonder the villagers don't show up!

Obviously these numbers were contrived to fit the story, but hopefully it makes the example easy to follow. In reality, the shepherd would probably have much more trust built up with the villagers to begin with - but they'd also be evaluating his actions using a theory of mind, not pure probabilities. Still, by quantifying the probabilities we can illustrate numerically how fragile trust can be, and how easily trust built up over many opportunities can be quickly eroded by a few missed opportunities.

Here's the whole theorem played out assuming the boy had 300 trustworthy days built up before the boy decides to play his prank:

\[ P(\text{Wolf|Cry}) = \frac{P(\text{Cry|Wolf})P(\text{Wolf})}{P(\text{Cry|Wolf})P(\text{Wolf}) + P(Cry|\neg \text{Wolf})P(\neg \text{Wolf})} \]

Day 1:

\[ P(\text{Wolf|Cry}) = \frac{1(0.05)}{1(0.05) + (1 - (301/303))(0.95)} \]

\[ P(\text{Wolf|Cry}) = \frac{0.05}{0.05 + 0.00665} \]

\[ P(\text{Wolf|Cry}) = \frac{0.05}{0.05665} = 0.882 = 88.2\% \]

Day 2:

\[ P(\text{Wolf|Cry}) = \frac{1(0.05)}{1(0.05) + (1 - (301/304))(0.95)} \]

\[ P(\text{Wolf|Cry}) = \frac{0.05}{0.05 + 0.0095} \]

\[ P(\text{Wolf|Cry}) = \frac{0.05}{0.0595} = 0.84 = 84\% \]

What this hopefully illustrates is how Bayes' Theorem is an improvement over the simple Rule of Succession as a tool for inference; it accounts for a lot of additional information. Using it effectively means gathering a lot more information, but also allows taking into account the many upstream probabilities on which the probability of interest depends.

Keep in mind that in these examples, we're holding various quantities constant (and breaking Cromwell's Rule). In reality, the probability that the boy would cry wolf every time there was a wolf wouldn't be 100% - it might be close, but it would be less. Similarly, the probability of a wolf showing up would change slightly every day - decreasing every day there was no wolf, and rising after its appearance on the third day.

Of course, the real lesson of the boy who cried wolf story isn't that the villagers were right to stop responding to the alarms. It's that they should've replaced the boy with someone more trustworthy.

The real world is full of questions that require us to evaluate evidence for, and Bayes' Theorem helps us break down the components of a probability in a way that can help us think more clearly about what those components are. It may require us to spend a lot of time gathering evidence in order to accurately estimate the likelihood of our hypothesis, but the more evidence we gather, the sharper our predictions will be.

## Cheat Sheet for Bayes' Theorem

A quick reference for using Bayes' Theorem to estimate probabilities. Try calculating some on your own and see how it goes!

\( P(A|B) \): The probability of \( A \), given the evidence \( B \).

\( P(B|A) \): The probability of \( B \), given the evidence \( A \).

\( P(A) \): The probability of \( A \).

\( P(B) \): The probability of \( B \), which consists of:

\( P(B) = P(B|A)P(A) + P(B|\neg A)P(\neg A) \)

\( P(B|\neg A) \): The probability of B *without* A.

\( P(\neg A) \): The probability of not A (the inverse of \( P(A) \).

The whole equation:

\[ P(A|B) = \frac{P(B|A)P(A)}{P(B|A)P(A) + P(B|\neg A)P(\neg A)} \]