Evaluating Predictions in Hindsight

Posted on April 16, 2020 by TheZvi

Epistemic Status: Confident I have useful things to say, but apologies for the long post because I don’t think it’s worth my time to make it shorter. Better to get thoughts down for those who want to read them.

Scott Alexander’s latest post points to the question of how best to evaluate predictions. The way he characterized leading predictions on Trump and Brexit, that ‘prediction is hard,’ instinctively bothered me. Characterizing the coronavirus situation the same way bothered me even more.

(Before anyone asks, I was ahead of the public but definitely dropped the ball on the early coronavirus prediction front, in the sense that I failed to make a lot of money and failed to warn the public, and waited far too long to move out of New York City. A large part of that fault was that I predicted things badly by my own standards. A large part was other failures. I take full responsibility. But that’s not what I want to talk about right now.)

How does one evaluate past predictions?

As someone who used to place wagers and/or price prediction markets for a living, and later traded with what I believe are some of the best traders on the planet, I’ve thought about this question a lot.

We can divide situation into easy mode, where there are a large number of independent predictions made and robust markets or probabilities against which one can evaluate those predictions, and hard mode, where this is not true, and you are often evaluating an individual prediction of a singular event.

Easy Mode

Most of my time was spent largely in the ‘easy mode.‘ Here, easy mode is when one is predicting lots of things for which there are established market prices, or at a minimum baseline fair values that one can be evaluated against. You have a lot of data points, and are comparing your predictions and decisions to a known baseline.

Easy mode makes it realistic to seek a metric that cannot be easily fooled, where you can use your results as evidence to prove what you are doing ‘works’ in some sense.

There is no one metric that is best even in easy mode. There are a few different ones that have merit. I’ll go through at least some of them.

Method One: Money Talks, Bull*** Walks

Did you make money?

If you did, congratulations. Good predicting.

If you didn’t, sorry. Bad predicting. If you didn’t bet, it doesn’t count.

This method has a lot to recommend it. It’s especially great over long periods of time with lots of distinct opportunities of relatively constant size and odds, where gains or losses from individual trades are well bounded.

There are however some severe problems, and one should try to seek other method.

If your trades have tail risk, and can have huge positive or negative payoffs, then that risk can often dominate your expected value (your Alpha) but not impact your observed results, or impact your observed results out of proportion to (or even in the opposite direction of) the Alpha involved.

If you sometimes trade big and sometimes trade small, that reflects your confidence and should be weighed, but also can lead to your bigger trades being all that matters. Which trades are big is often a matter of opportunity and circumstance, or an attempt to manipulate results, rather than reflecting what we want to measure.

Often trades and predictions are highly correlated even if they don’t look correlated.

Trading results often reflect other trading skills, such as speed and negotiation and ability to spot obvious errors. It’s very possible to have a trading strategy that spends most of its time doing things at random, but occasionally someone else typos or makes a huge mental error or there’s a bug in someone’s code, or you find a really dumb counter-party who likes to play very big, and suddenly you make a ton.

The exact method of trading, and which instruments are used, often has a dramatic effect on results even though it expresses the same underlying predictions.

Adverse selection shows up in real world trades where and how you least expect it to. Also exactly how you most expect it to.

And so on. It gets ugly.

These and other things allow someone trying to demonstrate skill to manipulate their results and usually get away with it, or for someone with good luck to look much better than they are, often for a remarkably long time.

In general, see the book Fooled by Randomness, and assume it’s worse than that.

Still, it’s money, and it’s useful.

Method Two: Trading Simulation, Where Virtual Money Talks

In this method, we start with a set of predictions, or a model that makes those predictions from data. Then we establish rules for how it will trade based on that information, and see whether the system makes money.

The big risk is that we can cheat. Our system would have made lots of money. Uh huh. Lots of ways to cheat.

Thus, the best simulations are where the person making the predictions is distinct from the person running the simulation, and the simulation runs in real time.

My good friend Seth Burn will run these simulations on sports models. He’ll take Nate Silver’s or ESPN’s or someone else’s predictions, translate them into win probabilities if necessary, then see what would happen if they were allowed to wager at market odds using Kelly betting. Sometimes it goes well. Other times it goes poorly. I generally consider making such models go full Kelly, without adjusting beliefs at all for market odds, a bit harsh. You Never Go Full Kelly. But I do get it.

When I run simulations on my own stuff, or often other people’s stuff, I will instead use threshold betting. If something is good enough, one unit will be wagered, and sometimes this will be scaled up to two or three units gradually as perceived edge improves. But we won’t force the system to Go Full Kelly. Because that would lead to heavily distorted results. The way you know if your plan is profitable is if it can make money a little at a time, not whether it would get lucky or blow itself up if you didn’t take reasonable precautions. And again, You Never Go Full Kelly.

Simulated trading is vital if you are planning to do actual trading. If you don’t simulate the things you intend to actually do, you can find yourself effectively testing a dramatically different hypothesis than the hypothesis you expected to test. That can end very badly.

These types of tests are good sanity checks and gut checks, all around. They make it much harder to fool yourself, if implemented reasonably.

Of course, in other ways, they make it easier to fool yourself.

Overfitting on results of any kind is highly dangerous, and this can encourage that and make it much worse. Often simulations are doing much more highly correlated things than one realizes, on any number of levels. Unscrupulous people of course can easily manipulate such results, it can become the worst kind of p-hacking taken up a level.

A big risk that is that you can think that your predictions are good because you have a handful of data errors. If your predictions are remotely sane, then any large error in historical prices will be something your simulation jumps on. You’ll make a ton on those, whereas in real life any attempt to take advantage of those opportunities would not have been allowed, and also not all that impressive an act of prediction. Guarding against this is super important, and usually involves manually looking at any situations where you think your edge is super large to ensure your recorded market prices are real.

Most of all, this method doesn’t actually reward accurate predictions. It rewards predictions that tend to disagree in the correct direction. That’s a very different thing.

Thus, think of this as an indicative and necessary method of evaluation wherever it is available, but in no way as a sufficient method, even when implemented properly in real time. But certainly, if the real time simulated test keeps working, I will consider updating my priors away from the market prices, and putting real money on the line after a while.

Method Three: The Green Knight Test

The Green Knight test gets its name from a character in the Arthurian legend. You get to swing at The Green Knight, then The Green Knight gets to swing at you.

Thus, you get to trade against the market at its fair price. Then the market gets to trade against you, at the model’s fair price, for the same amount. So if it’s a prediction market and the market says 50% and you say 60%, your net price is 55%. Whereas if you say 90%, your average price will be 70%, and you’ll do a lot worse.

How much you are allowing yourself to consider market prices, when deciding on your own beliefs, is your decision. If the answer isn’t ‘quite a lot’ it can get very expensive.

The point of The Green Knight Test is to use markets and trades to put you to the test, but to treat your model and the market as equals. The question is not whether you can directionally spot market inefficiencies. That’s (relatively) easy. I firmly believe that one can spot some amount of inefficiency in any market.

The question is, can you come up with better values than the market? That’s very, very hard if your process doesn’t heavily weigh the existing market prices. If you can pass this test without looking directly at the market prices at all, and you’ve confirmed that the market prices in question were real, your prices really are better than the market’s prices.

The even harder version of the test is to fully reverse the scenario. You take only the role of the market maker, allowing the market to trade at your model’s fair prices. If you can survive without a substantial loss, now you can fully reject the market’s prices, and treat your model’s prices as real.

The advantage of The Green Knight Test is it reminds you exactly how much you do not know, and holds you to a very high standard. Unless you are doing a pure math exercise like pricing derivatives, it’s expected that you will fail this test. It’s perfectly fine. The goal is to fail it less, and to remember that you fail it. Except when you actually pass it, then the sky’s the limit.

And yes, on one occasion that didn’t involve a derivative, I did pass this test convincingly. That’s a story for another day.

Method Four: Log Likelihood

I have no idea why I needed actual Eliezer Yudkowsky to first point out to me I should be using this, but once he did point this out it became obvious. Log likelihood for probabilistic outcomes are the obvious go-to standard thing to try.

If your goal is to reward accuracy and punish inaccuracy, log likelihood will do that in expectation. Your score on any given event is the natural log of your model’s probability of the outcome that happened.

Every time you improve your probability estimates, your expected score improves. Make your model worse, and it gets worse. Be highly overconfident and it will cost you quite a lot.

The best feature of log likelihood is that it provides perfect incentives.

The problem is that when you look at a score, you have no idea what you are looking at. There is no intuitive association between an LL score and a level of accuracy in prediction. Part of that is that we’re not used to them. The bigger issue is that a score doesn’t mean anything outside of the exact context and sample the score is based upon.

LL only scores mean something when you compare model one to model two on the exact same set of predictions.

They are all but useless with even tiny variations in what predictions are being scored. One additional unlikely event happening, or even one event being a foregone conclusion rather than a coin flip, will wipe out massive gains from model improvements, sometimes across thousands of predicted events.

What is meaningful is, we have this set of predictions, and we compare it to the market’s implicit predictions, and/or to another model or version of the model, and see which is better. Now we can get an idea of the magnitude of improvement (although again, what that magnitude means won’t be intuitive, other than to compare different score gaps with each other).

All of that skepticism assumes that everyone’s model is doing something sane. If someone is making huge mistakes, LL scores will pick it up very loudly as long as there is time to get punished for those mistakes enough times. If you’re going around saying 99% on 75% shots, or 20% on 50% shots, that will cut through a lot of noise.

Of course, if you were making errors that severe, there hopefully isn’t much need to use LL in order to realize that.

Method Five: Calibration Testing

This is the way Scott Alexander scores his predictions.

The principle is that your 60% predictions should happen 60% of the time, your 70% predictions should happen 70%, and so on. If they happen more often than that, you’re under-confident. If they happen less than that, you’re over-confident.

This is certainly a useful thing to check. If you’re consistently coming in with bad calibration, or are reliably badly calibrated at a particular point (e.g. perhaps your 10% chances are really 5%, but your 30%+ chances are roughly fair) then you can correct that particular mistake.

At a minimum, this is a bar that any predictor needs to clear if it wants to keep making probabilistic predictions with a straight face.

If you won’t put probabilities on your predictions, this test won’t work, except that we’ve already shown you aren’t doing very good predicting.

In most cases this will quickly reveal that someone isn’t trying to choose realistic probabilities. They’re saying words that they think will have a particular impact.

Such people can still be making useful predictions. To choose a very blatant example of someone doing this constantly, when Scott Adams says something is 100% going to happen, he neither believes this nor considers himself to be lying. To him, that’s just ‘good persuasion’ to anchor people high and force them to update. What he means is, ‘I think event X is more likely than you would think, so increase your probability estimate of X.’

There might or might not be a ‘substantially more than 50%’ actual prediction in there. If you read more than that into his statement, he’d say that’s your fault for being bad at persuasion.

Certainly he does not think that the numerous times a 100% to happen thing did not happen should send him to Bayes’ hell or cause people to dismiss his statements as worthless. He also doesn’t think one should ignore such misses, but why would you take someone’s stated numbers seriously?

Thus, asking if someone is well-calibrated is a way of asking if they are for reals attempting to provide accurate information, and if they have developed some of the basic skills required to do so. Learning whether this is so is very good and useful.

The problem with calibration testing is that you can get a perfect score on calibration without providing any useful predictions.

The direct cheat is one option. It’s very easy to pick things in the world that are 90% to happen, or 75%, or 50%, or 1%, if you are making up the statements yourself.

The more subtle cheat is another. You can have your 75% predictions be half things that are definitely true, and half things that are true half the time. Maybe you’re making a real error when you conflate them. Maybe you’re doing it on purpose. Hard to say.

This is typically what happens when people who are ‘well-calibrated’ give 90% (or 95% or 98% or 99.9%) probabilities. They’re mostly building in a chance they are making a stupid mistake or misunderstood the question, or other similar possibilities. Which you have to do.

Calibration is a good sanity check. It’s no substitute for actual evaluation.

Method Six: The One Mistake Rule

This method is where you look for an obviously wrong probability. Obviously wrong can be on the level of ‘a human who understands the space would know this instantly’ or it can be on the level of ‘upon reflection that number can’t possibly be right, or it contradicts your other answers that you’re still sticking with.’ The level required to spot a mistake, and how big a mistake you can spot, are ways of measuring how good the predictions are.

Often when you find an obviously wrong statement, you find something important about whoever made the statement. In many cases, you learn that person is a bullshit artist. In other cases, you learn that there’s something important they don’t or didn’t know or understand, or something they overlooked. Or you find something important about their world view that caused this strange answer.

And of course sometimes they’re right and you’re wrong. Also a great time to learn something.

Same thing for a model. If you find a model saying something clearly wrong, then you can use that to find a flaw in the model. Ideally you can then fix the flaw. Failing that, you hope to know what the flaw is so you can correct for it if it happens again – you can flag the model explicitly as not taking factor X into account.

Other times they made a sign or data entry error. There’s always bugs in the code. It’s not always a revelation.

That leads into the concept of evaluating an individual prediction. Which is what one must do in hard mode.

Hard Mode

In hard mode, our metrics don’t work. We need to use reason to think carefully about particular spots.

Looking back, we ask the question of whether our predictions and probabilities were good, what reasonable predictions and probabilities would have been and why, and what information we should have looked for or would have changed our opinions. There are a few different ways to evaluate.

One question to ask is, suppose we were to rewind time. How often would things again turn out the way they did, versus another way? How close was this event’s outcome? Could random events from there have changed the outcome often? What about initial conditions you had no way of knowing about? What about conditions you didn’t know about but could have checked, or should have checked, and what were those conditions? What would have had to have gone differently?

In some cases, one looks back and the result looks inevitable. In others, it was anything but inevitable, and if it had rained in different cities, or one person makes a different hard decision, or news stories happen to slant a different way or something, on the crucial day the other candidate gets elected. In others, it was inevitable if you were omniscient, but given your information it was anyone’s game.

Sports are a great tool for this question because remarkably few things in sports are truly inevitable. Sports are full of guessing games and physical randomness. Any Given Sunday really does mean something, and one can look back and say a game was 50% versus 65% vs. 80% vs. 95% vs. 99% vs. 99.9% vs. 99.99% for the favorite to win. The question of ‘what was the real probability’ is truly meaningful. Someone who said the wrong number by a sufficient margin can be objectively wrong, regardless of whether that favorite actually won.

That’s not true for many other things, but it is a useful perspective to treat it as if it was it more true from more perspectives in more ways than people think.

Obviously this is not an exact science.

For sports, I could go into endless examples and the merits of various methods of evaluation. One good standard there is ‘what would be the odds if they played another game next week?’ Which has some weird stuff in it but is mostly a concrete way of thinking about ‘what would happen if we re-ran the event and randomized the details of the initial conditions?’

Another good general approach is ‘what do I now know that I didn’t know before, and how does that change my prediction?’ Where did my model of events go wrong?

A third thing to do is to look at the components of your predictions. In hindsight, do the implied conditional probabilities make sense? When things started to happen, how did you update your model? If they had gone differently, how would you have updated, and would those updates have added up to an expected value close to zero?

A fourth thing to do is look at the hidden assumptions. What are your predictions assuming about the world that you didn’t realize you were assuming, or that turned out not to be true? Often you can learn a lot here.

A key takeaway from doing my analysis below of various predictions is that my opinion of the prediction often depends almost not at all on the outcome. Your prediction’s logic is still its logic. In many cases, the actual outcome is only one additional data point.

One cannot point out too many times how easy it is to fool yourself with such questions, if you are looking to be fooled, or even not looking to not be fooled.

Since most of my audience is not deep into the sportsball, I will illustrate further only with non-sports examples.

It makes sense to start with the two that inspired this post, then go from there.

Note that I’ll be doing political analysis, but keeping this purely to probabilities of events. No judgments here, no judgments in the comments, please.

Scott’s two examples

Scott’s two examples from his recent post were Brexit and the 2016 Presidential Election.

In both cases, predictors that are at least trying to try, such as Nate Silver and Tetlock’s forecasters, put the chances of things going the historical way at roughly 25% right before the elections happened. Also in both cases, mainstream pundits and conventional wisdom mostly claimed at the time that the chance was far lower, in many cases very close to (but not quite) 0%. In both cases, there were people who predicted the other outcome and thought it was likely to happen, but not many. Also in both cases, the result may have partially been caused by the expectation of the other result. If voters had realized the elections were close, voters might have decided differently.

Importantly, in both cases, the polls, which are the best first-level way to predict any election, had the wrong side ahead but by amounts that historically and statistically were insufficient to secure victory.

Both elections were very close. Remain had almost as many votes as leave, to the extent that different weather in different areas of the United Kingdom could have made the difference (London voted heavily remain, other places for leave). Trump lost the popular vote and barely won the electoral college, after many things broke his way in the final week and day.

These are textbook cases, in this system, of results that were very much not inevitable. It is very, very easy to tell stories of slightly different sequences of events in the final week or days that end in the opposite result. If everything visible had been the same but the outcome went the other way, it would not have been more surprising than what happened even in hindsight.

As we were warned would happen, both results were then treated as far more inevitable than they actually were. Media and people in general rushed to form a narrative that these results were always going to happen. The United Kingdom treated a tiny majority as an inviolate will of the people rather than what it was, evidence that the country was about evenly split. Everyone wrote about the United States completely differently than if a hundred thousand votes had been distributed differently, or any number of decisions had been made a different way.

If you bet on Trump or on Leave at the available market prices, you made a great trade.

But, if you claimed that those sides were definitely going to win, that it was inevitable (e.g. the Scott Adams position) then you were more wrong than those who said the same thing about Remain and Clinton. This seems clear to me despite your side actually winning.

The only way to believe that predicting a Trump win as inevitable was a reasonable prediction is to assume facts about the world not in evidence. To me, it is a claim that the election either was stolen, or would have been stolen if Trump had been about to lose it. Same or similar thing with Leave.

The generalized version of that, as opposed to election fraud, is a more common pattern than is commonly appreciated. The way that things that look close are actually inevitable is that the winning side had lots of things up their sleeve, or had effectively blocked the scenarios where they might lose, in ways that are hard to observe. Try to change the outcome and the world pushes back hard. They didn’t pull out their ace in the hole because they didn’t need it, but it was there.

I don’t merely think that Nate Silver’s ~25% chance for Trump (and 10% chance to win despite losing the popular vote!) was merely what Scott Alexander called it, a bad prediction but ‘the best we could do.’ I think it was actually a pretty great prediction, the reasonable hindsight range is something like 20% to 40%. You need to give a decent chunk of the distribution to Trump, and he can’t be the favorite. If your prediction was way off of this in either direction I think you were wrong. I think Remain vs. Leave follows a very similar pattern.

(For the 2020 Election, I similarly think that anyone who thinks either candidate is a huge favorite is wrong, and will almost certainly in hindsight still have been wrong in this way regardless of the eventual outcome, because so many things could happen on multiple fronts. To be confident you’d need to be confident at a minimum of the politics and the economics and the epidemiology. That doesn’t mean it will be close on election day, or in October.)

Scott’s calibration exercise

Scott’s predictions are a clean set of probabilities that are clearly fair game. Sticking there seems reasonable.

Let’s look at Scott’s predictions for the year 2019 next. How do they look?

By his convention, strikethroughs mean it didn’t happen, lack of a strikethrough means it happened.

Politics (Reminder, strategic discussions only, please)

Donald Trump remains president: 90%
Donald Trump is impeached by the House: 40%

The house impeached Trump for something that, as of the time of the prediction, hadn’t happened yet. It is clear the actual barrier to convincing Pelosi was high. If things had been enough worse, impeachment might have not have happened because resignation. So you could reasonably say the 40% number looks somewhat high in hindsight. The argument for it not being high is if you think Trump always keeps escalating until impeachment happens, especially if you think Trump actively wanted to be impeached. I’m inclined to say that on its own 40% seems reasonable, as would have 20% or 30%.

The 90% number is all-cause remaining President. Several percent of the time Trump dies of natural causes, as he’s in his 70s. Several percent more has to be various medical conditions that prevent him from serving. Again, he’s in his 70s. World leaders also sometimes get shot, we’ve lost multiple presidents that way. Also, he’s impulsive and weird and looks like he often hates being president so maybe he decides to declare America great again and quit. And if there’s a dramatic change to world conditions and the USA doesn’t have a president anymore, he’s not president. Small probabilities but they add up. The majority of the 10% has to be baked in. We can reduce some of those a little in hindsight but not much.

So saying 90% is actually giving a very small probability of Trump leaving office for other reasons, especially given a 40% chance of impeachment – his probability of surviving politically conditional on the house being willing to impeach has to be at least 90%. Given the ways Trump did react and might have reacted to such conditions, and that some of the time the underlying accusations are much worse than what we got, this looks overconfident at 90% and I’d prefer to see 80%. But a lot of that is the lack of precision available when you only predict by 10% increments; 85% would have been fine.

~~3. Kamala Harris leads the Democratic field: 20%~~
~~4. Bernie Sanders leads the Democratic field: 20%~~
5. Joe Biden leads the Democratic field: 20%
~~6. Beto O’Rourke leads the Democratic field: 20%~~

(Disclosure, at PredictIt I sold at various points all but three candidates, one of those three was Joe Biden, and my mistake in hindsight was not waiting longer to sell a few of them along with not selling one of the other two when I had the chance).

Scott’s nominee predictions, however, seem really sloppy. These four candidates were not equally likely. The prediction markets didn’t think so, their backgrounds and the polls didn’t think so. The dynamics we saw play out don’t think so, either. Things came down to a former vice president to a popular president who led in the polls most of the way versus the previous cycle’s runner up.

Putting them on equal footing with a random congressman from Texas who lost a close race once while looking exciting, or a more traditionally plausible alternative candidate like Kamala Harris, doesn’t age well.

Nor does having these all be 20% and adding to 80%, leaving 20% left for the other 16 or so candidates including Elizabeth Warren, plus any unexpected late entries.

The defense of the 20% on Biden is to say Biden was known to be old and a terrible candidate who predictably ran a terrible primary campaign, so he was overrated even though he ended up winning, while Harris and O’Rourke were plausibly very good candidates given what we knew at the time. I do think there’s broad range for such arguments, but not to this extent.

This is where calibration makes you look good but shouldn’t. Name the four leading candidates (or at least four plausible-to-be-top-four candidates, to be generous) and give them each 20% and your calibration will look mostly fine even if that evaluation doesn’t make sense and the remaining field is really more like 30-40% than 20%.

This is also where the human element can warp your findings. There’s a lot of ‘X has to be higher than Y’, or ‘X ~= Y here looks sloppy’ or ‘X can’t be an underdog given Z’ or what not. We have a lot of rules of thumb, and those who break those rules will look worse than they deserve, while those that follow those rules but otherwise talk nonsense will look better.

As usual, use a variety of evaluation methods and switch them up when it looks like someone might be Goodharting.

7. Trump is still leading in prediction markets to be Republican nominee: 70%
8. Polls show more people support the leading Democrat than the leading Republican: 80%

This 70% number seems like a miss low to me if you accept Scott’s other predictions above. In Scott’s model, Trump is 90% to be President, which means he’s now twice as likely to be President but losing the nomination fight, despite at the time facing zero credible opposition. If you again take out the 5%+ chance that Trump is physically unfit for office and leaves because of it, that makes it many times more likely to Scott that Trump can’t get the nomination but stays President, versus him stepping down. I can’t come up with a good defense of less than 80% or so in this context.

Predicting the Democratic candidate as likely to be ahead seems right, as that had been largely both true and stable for a while for pretty much any plausible Democratic candidate. 80% seems a little overconfident if we’re interpreting this as likely voters, but not crazy. A year is a long time, the baseline scenario was for a pretty good economy, and without anything especially good for Trump happening we saw some close polls.

Of course, if we interpret this as all Americans then 80% seems too low, since non-voters and especially children overwhelmingly support Democrats. And if we literally read this as people anywhere then it should be 95% or more. A reminder of how important it is to word predictions carefully.

9. Trump’s approval rating below 50: 90%
~~10. Trump’s approval rating below 40: 50%~~

90% seems overconfident to me, although 80% would have been too low. It’s saying that the world is definitely in ‘nothing matters’ mode and meaningful things are unlikely to happen. This of course goes along with the 80% chance he’ll be behind in the polls, since if he’s above 50% approval he’s going to be ahead in the polls almost every time.

50% for approval ratings below 40 seems clearly more right than 40% or 60% would have been. This is an example of predictions needing to be evaluated at the appropriate level of precision. It’s easy to say “roughly 50%” here, so the ‘smart money’ is the ones who can say 53% instead of 50% and have it be accurate. So credit here for staying sane, which is something.

~~11. Current government shutdown ends before Feb 1: 40%~~
12. Current government shutdown ends before Mar 1: 80%
13. Current government shutdown ends before Apr 1: 95%
~~14. Trump gets at least half the wall funding he wants from current shutdown: 20%~~
15. Ginsberg still alive: 50%

I would not have been 95% confident that the shutdown wouldn’t extend past April 1. It doesn’t seem implausible to me at all that the two sides could have deadlocked for much longer, since it’s a zero-sum game with at least one of the players as a pure zero-sum thinker and where the players hate each other. There were very plausible paths where there were no reasonable lines of retreat. Once we get into March, chances of things resolving seem like they do down, not up. I think the 40% and 80% predictions look slightly high, but reasonable.

I am not enough of a medical expert to speak to Ginsberg’s chances of survival, but I’m guessing 50% was too low.

ECON AND TECH
16. Bitcoin above 1000: 90%
17. Bitcoin above 3000: 50%
18. Bitcoin above 5000: 20%
19. Bitcoin above Ethereum: 95%
20. Dow above current value of 25000: 80%
~~21. SpaceX successfully launches and returns crewed spacecraft: 90%~~
~~22. SpaceX Starship reaches orbit: 10%~~
23. No city where a member of the general public can ride self-driving car without attendant: 90%
~~24. I can buy an Impossible Burger at a grocery store within a 30 minute walk from my house: 70%~~
25. Pregabalin successfully goes generic and costs less than $100/month on GoodRx.com: 50%
26. No further CRISPR-edited babies born: 80%

The first question I always wonder when I see predictions about Bitcoin is whether the prediction implies a buy or implies a sale.

At the time of these predictions, Bitcoin was trading at roughly $3,500.

Scott thought Bitcoin was a SCREAMING BUY.

The reason this represents a screaming buy is that Scott has Bitcoin almost 50% to be trading higher versus lower. But if Bitcoin is higher, often it is double its current price or higher, which in fact happened. You have a long tail in one direction only. Even in Scott’s numbers, the 20% vs. 10% asymmetry at 1000 and 5000 points towards this.

Was that right, given what he knew? I… think so? Probably? I was already sufficiently synthetically long that I didn’t buy (if you’re founding a company that builds on blockchain, investing more in blockchains is much less necessary), but I did think that the mean value of Bitcoin a year later was probably substantially higher than its $3,500 price.

What is clearly wrong is expecting so little variance in the price of Bitcoin. We have Bitcoin more likely to be in the 3000-5000 range, or the 2000-3000 range, than to be above 5000 or below 1000. That doesn’t seem remotely reasonable to me, and I thought so at the time. That’s the thing about Bitcoin. It’s a wild ride. To think you shouldn’t be on the ride at all, given the upside available, you have to think the ride likely ends in a crash.

Bitcoin above Ethereum at 95% depends on how seriously you treat the correlation. At the time Ethereum was roughly $120 per coin, or about 3% of a Bitcoin. Most of Etherium’s variance for years has been Bitcoin’s variance, and they’ve been highly correlated.

Note that this isn’t ETH market cap above BTC market cap, it’s ETH above BTC, which requires an extra doubling.

If we think about three scenarios – BTC up a lot, BTC down a lot, BTC mostly unchanged – we see that ETH going up 3000% more than BTC seems like a very crazy outcome in at least two of those scenarios. Given how little variance we’ve put into BTC, giving ETH that much variance in the upside or mostly unchanged scenarios doesn’t make sense.

So the 5% probability is mostly coming from a BTC collapse that ETH survives. BTC being below 1000 is only 10% in this model. Of that 10%, most of the time this is a general blockhain collapse, and ETH does as badly or worse. So again, aside from general model uncertainty and ‘5% of the time strange things happen’ 5% seems super high for the full flippening to have happened, and felt so at the time.

And of course, again, if ETH is 5% to be above BTC and costs 3% of BTC, then ETH is super cheap relative to BTC! It’s worth more just based on this scenario sometimes happening! Anyone who holds BTC is a complete fool given this other opportunity, unless they are really into balancing a portfolio.

It’s important to note when predictions are making super bold claims, especially when the claims do not look that bold.

The Dow being 80% to be above its current value, by contrast, is a very safe and reasonable estimate, since crashes down tend to be large and we expect the market on average to have positive returns. Given rounding, can’t argue with that, and wouldn’t regardless of the outcome unless there was a known factor about to crash it (e.g. something analogous to covid-19 that was knowable at the time).

On to SpaceX. Being 90% confident of anything being accomplished in space travel for the first time by a new institution within a given year seems like a mistake given what I know about space travel. But I have not been following developments, so perhaps this was reasonable (e.g. they had multiple opportunities and well-planned-out missions to do this, and it took a lot to make it not happen). Others can fill this in better than I can. I have no idea how to evaluate their chances of reaching orbit, since that depends on the plausibility of the schedule in question, and how much they would care about the milestone for various reasons.

The self-driving car prediction depends on exactly what would have counted. If this would have to have been on the level of ‘hail a driverless cab to and from a large portions of a real city’ than 10% seems very reasonable. If it would have been sufficient to have some (much lesser) way in which a member of public could ride a driverless car, I think that wasn’t that far away from happening and this would have been too low.

I am very surprised that Scott couldn’t at the time buy an Impossible Burger within a 30 minute walk from his house. I know where his house is. I can buy one now, within a 30 minute walk from my house (modulo my complete unwillingness to set food in a grocery store, and also my unwillingness to buy an Impossible Burger), and in fact have even passed “meat” sections that were sold out except for Impossible Burgers. Major fast food chains sell them. Of course, they had a very good year, almost certainly much better than expected. So 70% seems fine here, to me, with the 30% largely being that Impossible Burgers don’t do as well as they did, and only a small portion of it being that Scott’s area mysteriously doesn’t carry them. Seriously, this is weird.

The prediction on Pregabalin I have no way to evaluate.

The question of CRISP-er edited babies should have been worded ‘are known to have been born’ or something similar, to make this something we can evaluate. Beyond that, it’s a hard one to think about.

WORLD
~~27. Britain out of EU: 60%~~
~~28. Britain holds second Brexit referendum: 20%~~
29. No other EU country announces plan to leave: 80%
30. China does not manage to avert economic crisis (subjective): 50%
31. Xi still in power: 95%
32. MbS still in power: 95%
~~33. May still in power: 70%~~
34. Nothing more embarassing than Vigano memo happens to Pope Francis: 80%

Once again I the 95% numbers seem too high even when I can’t think of an exact scenario where they lose power, but again it’s not a major mistake.

The Vigano memo seems unusually embarrassing as a thing that happens to the Pope relative to the average year, thinking historically. Most years nothing terribly embarrassing happens to Popes, the continuing abuse scandal seems like the only plausible source for embarrassing things, and Francis seems if anything less likely than par to generate embarrassing things. So if anything 80% seems low, unless I’m forgetting other events.

The China prediction is subjective, and I don’t think I would have ruled it the same way Scott did, so it’s really tough to judge. But in general 50% chance of economic crisis within one year is a very bold prediction, so I’d want to know what made that year so different and whether it proved important.

Now it’s time to talk about the EU, and what happens after you vote for Brexit. It’s definitely been a chaotic series of events. It definitely could have gone differently at various points. Sometimes I wonder what would have happened if Boris Johnson had liked his Remain speech rather than his Leave speech.

I like 60% as a reasonable number for Britain out of EU in 2019. There were a lot of forces pushing Britain to leave given the vote. There were also practical reasons why it was not going to be easy, and overwhelming support for remaining in the EU in parliament if members got to vote their own opinions. Lots of votes throughout the year seemed in doubt several times over, with May and others making questionable tactical decisions that backfired and missing opportunities all the time. The EU itself could have reacted in several different ways. Even now we can see a lot of ways this could have gone.

How about 20% for a second referendum? We can consider two classes of referendum, related but to me they seem importantly distinct.

There’s the class where Her Majesty’s Government decides to do what the EU often does, which is have the voters keep voting until they get the right result. Given the vote was very close, and that leaving turned out to not look like voters were promised, the only thing preventing this from working was some sort of mystical ‘the tribe has spoken’ vibe that took over the country.

Then there’s the class where the EU won’t play ball, or the UK politicians want to vomit when they see the kind of ball the EU was always prepared to play. They’re looking at a full Hard Brexit, and want to put the decision of whether or not to accept that onto the people.

Thus it’s not obvious in hindsight whether the referendum was more likely in the “Britain leaves” world or the “Britain stays” world, given that was already up in the air. Certainly it feels like something unlikely would have had to happen, so we’re well under 50%, but that it wasn’t that far from happening, so it was probably more than 10%. 20% seems fine.

May being 70% to stay in power, however, feels too high. May was clearly facing an impossible problem, while being committed to a horrible path, in a world where prime ministers are expected to resign if they don’t get their way. How often would Britain still be in the UK at the end of the year while May survived? That seems pretty unlikely to me, especially in hindsight, whereas Britain leaving without May seems at least as likely. So May at 70% and leaving at 60% doesn’t seem right.

SURVEY
35. …finds birth order effect is significantly affected by age gap: 40%
36. …finds fluoxetine has significantly less discontinuation issues than average: 60%
37. …finds STEM jobs do not have significantly more perceived gender bias than non-STEM: 60%

(#38 got thrown out as confusing and I don’t know how to evaluate it anyway)

I would have been more confident on the merits in 35 and 37. Birth order effects have to come from somewhere, and the ‘affected’ side gets both directions. And the STEM prediction lets you have both about as much perceived bias and less bias, and I had no particular reason to believe it would come out bigger or smaller.

What’s more interesting, although obviously from a small sample size, is that all three proved true. So Scott’s hunches worked out. Should we suspect Scott was underconfident here?

This could be a case of Unknown Knowns. Scott has good reason to believe in these results, the survey has enough power to find results if they’re there, but Scott’s brain refuses to be that confident in a scientific hypothesis without seeing the data from a well-run randomized controlled trial.

I kid, but also there’s almost certainly a modesty issue happening here. I would predict that Scott would be reliably under-confident in his hunches that he thought enough of to include in his survey.

I started to go over Scott’s personal predictions, but found it mostly not to be a useful exercise. I don’t have the context.

There is of course one obvious thing to note.

PERSONAL – PROJECTS
~~63. I finish at least 10% more of [redacted]: 20%~~
~~64. I completely finish [redacted]: 10%~~
~~65. I finish and post [redacted]: 5%~~
~~66. I write at least ten pages of something I intend to turn into a full-length book this year: 20%~~
~~67. I practice calligraphy at least seven days in the last quarter of 2019: 40%~~
~~68. I finish at least one page of the [redacted] calligraphy project this year: 30%~~
~~69. I finish the entire [redacted] calligraphy project this year: 10%~~
~~70. I finish some other at-least-one-page calligraphy project this year: 80%~~

PERSONAL – PROFESSIONAL
71. I attend the APA Meeting: 80%
~~72. [redacted]: 50%~~
73. [redacted]: 40%
74. I still work in SF with no plans to leave it: 60%
75. I still only do telepsychiatry one day with no plans to increase it: 60%
76. I still work the current number of hours per week: 60%
77. I have not started (= formally see first patient) my own practice: 80%
78. I lease another version of the same car I have now: 90%

None of the personal projects happened. Almost all the professional predictions happened, most of which predict the continued status quo. That all seems highly linked, more like two big predictions than lots of different predictions. One would want to ask what the actual relevant predictions were.

Overall, clearly this person is trying. And there’s clearly a tension between getting 95% of 95% predictions right, and having most of them actually be 95% likely. Occasionally you screw up big and your 95% is actually 50%, and that can often be the bulk of the times such things fail. Or some of them are 85%, but again that can easily be the bulk of the failures. So it’s not entirely fair to complain about a 95% that should be 99% unless standards are super high.

Mostly, I’d like to encourage looking back more in this type of way when possible, in addition to any use of numeric metrics.

I also should look at my own predictions, but also want to make that a distinct post, because its subject matter will have a different appeal on its own merits.

I hope this was helpful, fun, interesting or some combination of all three. I don’t intend it to be perfectly thought out. Rather, I thought it was a useful thing for those interested, so I’d write it quickly, but not let it take too much time/effort away from other higher priority things.

This entry was posted in Death by Metrics, Guide, Rationality. Bookmark the permalink.

22 Responses to Evaluating Predictions in Hindsight

Kenny says:

April 16, 2020 at 9:43 pm

I found this helpful, fun, AND interesting – thanks!

Reply
hnau says:

April 16, 2020 at 10:01 pm

This was really interesting and useful, thanks!

The section about the election outcomes in particular was a lot of food for thought. Given the median voter theorem we should naive expect election results to be coin flips, right? Deviations from that should indicate factors that can’t or won’t keep up with voter preferences– lock-in of candidates, partisan ideologicial precommitments, delays and uncertainties in messaging / polling, and so forth. And admittedly those can be big factors. What’s interesting is that in these two cases they apparently either canceled out quite neatly (to 1 part in 10 or better, say) or weren’t nearly as big as one might expect. Which suggests the provocative (to me at least) idea that Brexit was more of an election than a referendum– voters were choosing between broad and somewhat flexible political camps, not just Leave/Remain.

Because of that– and also because in hindsight the mainstream 25%-ish estimates seem subject to biases / mistaken assumptions about popular sentiment– I’d be inclined to peg the correct prediction for both elections as somewhere more in the 40-60% range. What factors am I missing?

Reply
- TheZvi says:
  
  April 16, 2020 at 11:44 pm
  
  I think that’s a very outside-view perspective and also one trained on recent data where elections are close and upsets likely. Often that’s not true, and MVT doesn’t really have that much power because neither side is choosing optimally. Certainly in USA we didn’t see something like MVT in play, instead we saw two partisan camps and enthusiasm mattered and both sides tried to use negative arguments to persuade/enthuse/etc.
  
  With leave vs. remain, it certainly doesn’t seem like the two were choosing optimal points according to MVT. If anything it was weirdly *not* a normal election, e.g. labor was somehow neutral on the biggest question in 50 years. And on a single issue, both sides can choose their details but I don’t think those details mattered much, it was just rhetoric that was better or worse. And the reason the vote happened was largely because the remain side assumed it would win (I think).
  
  Reply
  - TomGrey says:
    
    April 18, 2020 at 8:14 pm
    
    Tory leader & PM David Cameron promised a referendum. So he gave one. He, like most elite establishment, was a Remainer (continue supporting More Elite Power!). He was fairly sure all the “right people” were Remainers, and so, would win.
    
    Good time to mention that there have been many EU elections where the elites failed — and then ran the election again, and won. Lots of folk actually do care about this.
    
    Reply
sniffnoy says:

April 17, 2020 at 2:23 am

You mention the logarithmic scoring rule (note: it’s not really log likelihood, there’s no likelihood ratios involved), but worth noting that there are any number of proper scoring rules one could use in place (literally infinitely many since all that really matters is the convexity). Brier score (quadratic) is a commonly used one if you’re worried that log is too harsh.

Calibration to me seems like it may be measuring something different than how good your predictions are. Not to say that calibration is unimportant, it just, like, seems to be something different. IDK. I think people in the LW diaspora, or at least certain prominet people, sometimes focus a bit too much on calibration; this leads to like Scott saying on SSC things like “Yes I know 50% statements are meaningless”. No, they’re obviously not; it’s just measuring calibration of them that’s meaningless. But that’s what he does each year, is graph his calibration. But I’m a little uncertain that calibration really belongs here.

Reply
- TheZvi says:
  
  April 17, 2020 at 10:53 am
  
  I think of calibration as a subset of good predicting. If you’re badly calibrated you’re definitely bad, but if you’re well-calibrated you could still be bad. It’s definitely frustrating to deal with the “50% predictions don’t matter” crowd all the time. Obviously there are many ways out of that (e.g. reformulate thresholds so they’re not 50% if you don’t have any alternative evaluation methods).
  
  Other rules didn’t get listed because I haven’t had much use out of them. Quadratic scores I’ve never found useful when evaluating against binary/discrete outcomes, but they’re good when you’re doing estimations of continuous outcomes. I have used them in contexts where it’s important to not miss large (e.g. your actual concern is opening the market close to the market price and you get punished roughly quadratically when you mess up).
  
  Reply
  - sniffnoy says:
    
    April 17, 2020 at 2:45 pm
    
    I think of calibration as a subset of good predicting. If you’re badly calibrated you’re definitely bad, but if you’re well-calibrated you could still be bad.
    
    Yeah, this seems like a good way of putting it.
    
    Reply
Doug S. says:

April 17, 2020 at 2:56 am

Is there a way to bet that in the long run Bitcoin is a bubble and is going to be worthless? Like, if I think Bitcoin is going to be worth $1 or less in 2030 or so but might go up a lot before the bubble bursts?

Reply
- TheZvi says:
  
  April 17, 2020 at 10:56 am
  
  No good or practical way.
  
  I mean, you’re *probably* right that BTC in 2030 is worth a lot less than BTC in 2020 in dollar terms. If that wasn’t true BTC would be stupidly cheap (I think it likely stabilizes in those worlds at more like $100 than <$1, because collector/historical value remains, unless the security is broken, but it doesn't matter). The problem is that any 'short' on BTC can break you in the meantime unless it's a bet or binary option. So it has to be that (e.g. a contract that pays $100 if BTC<$100 on January 1, 2030, $0 otherwise). If you can find it, you can do it.
  
  (Note that this contract is SUPER WEIRD in a lot of ways that I won't get into here!)
  
  Reply
  - TomGrey says:
    
    April 18, 2020 at 8:20 pm
    
    I’d guess Bitcoin continues to slowly increase in value — because of the real economy deflation/ productivity increases combined with global monetary inflation so the price levels of food & clothes remain flat. Lots of the $2 trillion deficit spending will end up inflating asset prices, including speculative crypto assets, like Bitcoin.
    
    As rich commies try to flee China, Bitcoin and other cryptos are likely to become more popular havens. Also more popular hacking targets, so few will be so trusting as to put most of their financial eggs in any one crypto basket.
    
    Reply
  - Doug S. says:
    
    April 22, 2020 at 5:19 am
    
    Ideally to make this bet I’d want to buy out-of-the-money put options that expire way in the future, (Bitcoin goes down to $1 and I sell them to the counterparty for $500 or whatever) but IIRC most financial institutions won’t sell options with an expiration date more than six months to a year.
    
    Reply
notpeerreviewed says:

April 17, 2020 at 5:09 pm

There’s definitely a Burger King within a 30 minute walk from his house. Do all Burger Kings have Impossible Burgers? The closest place I know for certain has them is Umami Burger, which is more than 30 minutes from Scott.

Reply
- TheZvi says:
  
  April 17, 2020 at 7:49 pm
  
  Prediction explicitly says “grocery store” so that doesn’t count. Plus, who would want to.
  
  Reply
  - notpeerreviewed says:
    
    April 17, 2020 at 8:36 pm
    
    Whoops, my bad for not reading more carefully.
    
    Reply
sniffnoy says:

April 26, 2020 at 8:26 pm

Kind of tangential, but Eric Neyman recently put up an interesting post on tailoring scoring rules to incentivize precision: https://ericneyman.wordpress.com/2020/04/24/scoring-rules-part-3-incentivizing-precision/

Reply
Pingback: SlateStarCodex 2020 Predictions: Buy, Sell, Hold | Don't Worry About the Vase
PDV says:

May 5, 2020 at 5:05 pm

> also my unwillingness to buy an Impossible Burger

I’m curious where this unwillingness comes from. The Impossible Burger is genuinely very good. Most of the burgers I eat recently are from The Melt, an upscale fast-casual chain local to SF; they offer Impossibles for an extra $3 and I basically always get it because it’s just as good. From a much foodier person than me:

> For years [Michael Symon has] shrugged off what I presume are some pretty financially-incentivized taste tests, refusing to serve veggie burgers at his hamburger chain The B Spot.
> Except he started serving The Impossible Burger last month.
> … My wife and I made a date.
> … We took a bite. Chewed, puzzled. Then we took a bite of the meat-burger. “…the real meat is better,” we said, unsurprised, but […] we could not figure out what the difference was between the two burgers.
> Then it came to us what the difference was: We’d stopped handicapping the veggie burger.

Source: https://www.theferrett.com/2017/10/09/i-tried-the-vegetable-blood-burger-substitute-heres-what-i-thought/

Reply
- TheZvi says:
  
  May 5, 2020 at 5:23 pm
  
  Because I can buy and eat an actual hamburger that’s better for me and tastes better and is cheaper, even in the IB’s best case? Doesn’t have to be complicated.
  
  Reply
Pingback: 2020 Election: Prediction Markets versus Polling/Modeling Assessment and Postmortem | Don't Worry About the Vase
Pingback: Judging Our April 2020 Covid-19 Predictions | Don't Worry About the Vase
Pingback: Omicron Post #11 | Don't Worry About the Vase
Pingback: Evaluating 2021 ACX Predictions | Don't Worry About the Vase

	F. E. Guerra-Pujol on AI #61: Meta Trouble
	F. E. Guerra-Pujol on AI #61: Meta Trouble
	ConnGator on AI #61: Meta Trouble
	Evan Þ on Changes in College Admiss…
	nunya on On Llama-3 and Dwarkesh Patel…