Twitter Polls: Evidence is Evidence

Follow-up to: Law of No Evidence

Recently, there was some debate about a few Twitter polls, which led into a dispute over the usefulness of Twitter polls in general and how to deal with biased and potentially misleading evidence.

Image
Image

Agnus Callard is explicitly asking the same question I asked, which is the opposite of ignoring sample bias: What is accounting for the difference?

Sample selection is definitely one of the explanations here. One can also point to several other key differences.

  1. My poll asks about you, Patrick asks about how others seem.
  2. My poll asks about struggle, Patrick asks about stability.
  3. My poll asks about a year versus a point in time, a potential flaw.
  4. My poll asks about now, Patrick asks about since pandemic onset.

None of this is well-controlled or ‘scientific’ in the Science™ sense. No one is saying any of this is conclusive or precise.

What is ‘bad’ evidence if it isn’t weak evidence? Adam’s theory here is that it is misleading evidence. That makes sense as a potential distinction. Under this model:

  1. Weak evidence induces a small Bayesian update in the correct direction.
  2. Bad evidence can induce an update in the wrong direction.

Usually, people with such taxonomies will also think that strong evidence by default trumps weak evidence, allowing you to entirely ignore it. That is not how that works. Either something has a likelihood ratio, or it doesn’t.

The question is, what to do about the danger that someone might misinterpret the data and update ‘wrong’?

I love that the account is called ‘Deconstruction Guide.’ Thanks, kind sir.

Whether or not this ‘depends on the poll’ depends on what level of technically correct we are on, and one can go back and forth on that several times. The fully correct answer is: Yes, some info. You always know that the person chose to make the poll, and how many people chose to respond given the level of exposure, and the responses always tell you something, even if the choices were ‘Grune’ and ‘Mlue,’ ‘Yes’ and ‘Absolutely,’ or ‘Maybe’ and ‘Maybe Not.’

Remember that if any other result would have told you something, then this result also tells you something, because it means the result that would have told you something did not happen. That doesn’t mean it helps you with any particular question.

Anyway, back to main thread.

Getting into a Socratic dialog with a Socratic philosopher, and letting them play the role of Socrates. Classic blunder.

I certainly want to know the extent to which the world is full of lunatics.

Adam Gurri’s new claim has now narrowed to something more reasonable, that citing a Twitter poll as representative even of some subgroup marks you as foolish.

We can agree that taking a Twitter poll, not adjusting for sample bias, and drawing conclusions is foolish. Saying it equates to a subgroup that is similar to the group polled still requires dealing with response bias and all that, but mostly seems fine. Adjusting for the nature of your sample should render the whole thing fine in any case.

You can also find good information in a Twitter poll by comparing its results to another Twitter poll using the same account (and same retweets, ideally). The difference between the two is meaningful. This can be a difference between questions or wordings, or a difference over time, or something else.

Rules of Evidence

Aristotle is indeed wise. He points to the important distinction between evidence, as in Bayesian evidence or a reason one might change one’s mind or one’s probabilities, and the rules of evidence in a given format of debate or discourse. In a court of law, some forms of Bayesian evidence are considered irrelevant or, even more extremely, prejudicial, exactly because they should cause one to update their probabilities and the law wants the jury not to do that.

Which is sometimes the right thing to do. Still, you have to admit it is kind of weird.

I think a lot of the reason it is so often right to do it is because we use very strange standards of evidence and burdens of proof in other places, forcing corrections. And also of course juries are random people so they have a lot of biases and we worry about overadjustments. Then there are the cases where we think the jury would reach exactly the right conclusion, but we think that’s bad, actually.

Anyway.

In the formal rules for public discourse, how should we consider Twitter polls?

A Twitter poll without proper context should be fully inadmissible here.

What about with the proper context? That gets trickier.

I consider what I do on my blog a form of public discourse, and I notice that in whatever thing that it is I am doing in most posts, a Twitter poll with context is obviously admissible. That is because ‘the thing I am doing’ is attempting to reason in public and establish a model of the world, how it works and what it is going to do. I am not trying to persuade anyone as such.

That’s a different department.

We should strive to minimize our visits to that department, whenever possible.

Exactly. Keep your evidential requirements as low as possible. But no lower.

I do occasionally, and likely will more often in the future, visit the other department. In those situations, I am more careful about using such evidence. I know it is by its nature unpersuasive to most, and a point of vulnerability, and requires a certain level of epistemic trust. Thus, in these situations, I try even more than usual to at most rely on it and other similar facts only for loose bounds and non-binding intuitions – by default, it’s not admissible.

Crux One

And now, at least I hope, a crux.

Yes, exactly. Everything is evidence. You should updateon almost anything. That is indeed how probability and knowledge work.

To state the obvious, if evidence does not cause one to be more likely to be led to the correct conclusion, you are doing evidence wrong, bro do you even Bayes?

My first response would be to attempt to fix it. If I couldn’t, then yes, I would consider not seeking out, or even actively avoiding, such information.

The tricky case is when you are being shown evidence that is selected to attempt to change your mind. Which is the basis of most ‘public discourse,’ especially that which is going to engage with someone (in any direction) with a publication called Liberal Currents. In such situations, you need to ask what actual evidence you are getting when you are given evidence. Often this is mainly comparing the quality and strength of the evidence you got to the quality and strength you would expect. If the evidence is weaker than you expected, you should update in the opposite direction on the information that this was the best this source could do.

I do not understand the claim that ‘we have statistics’ on the Twitter poll question. Is Adam suggesting someone ran a Proper Scientific Study on people’s updates from looking at Twitter polls? Which seems very hard to do usefully, and I assume is not it. Instead, I am assuming he means ‘we have statistical tools for evaluating samples and they say that your samples are worthless.’

I think this claim is simply doing statistics wrong. The samples are quite big enough. All you have to do is understand the nature of the samples. Or, use the poll to get insight into the sample. Which, then, you can, among other things, poll again later.

Whenever I read a scientific paper, there is about a 50/50 chance I conclude that they have buried the lead, often entirely missing the lead, even if I also agree with their main claim. They do not realize what they have learned. They do the equivalent of concluding that the key thing in life is herring sandwiches, instead of realizing it is boredom.

Instead of looking for something specific, look for anything at all. Much better odds.

Crux Two

Thus:

Tiago nails it. Knowing that different samples and differently worded questions and answers explain the answer is better than not knowing that. One should not mistake it either for Deep Wisdom, or for the main thing available to be learned. It is a way to avoid learning what there is to learn, by figuring out which differences did it. There is a surprising result. It has a cause, and the details there are often going to be interesting. Using ‘there is a cause one could find’ as a semantic stop sign will not help you.

Indeed, I realized I could Do Science to the situation. Was it primarily the different samples, or was it primarily the different wording? There’s a way to find out!

I grabbed the results here because someone new retweeted the poll, potentially corrupting the comparison after that, and any sample >300 is fine here. Here is the larger sample, which converged some towards Patrick’s results.

That is exactly Patrick’s wording. Does it match Patrick’s poll?

Image

Mostly it does. The difference is that my sample includes more ‘about the same’ and less at the extremes, which is likely cultural differences in what counts as about the same. I’m also guessing my audience has a lower-than-usual Lizardman Constant, and that together they explain the whole difference.

Thus, we have learned that, at least in this context, no, the samples are very similar. Mostly the difference is the wordings. If Patrick were to do my exact poll For Science, I expect him to get roughly my result with a bit more noise.

The next step, if one wanted to continue learning, would be to change individual components and see if anything more changed – e.g. do Patrick’s wording with respect to yourself only.

Does this represent people having a more optimistic view of themselves than they do of others? Or is this people correctly doing aggregation, since 10% of people becoming less stable makes people overall less stable and larger groups have less variance? My presumption is this is a mix.

This still does leave stability down versus the original finding of struggling also down. That too is logically compatible but on its own implausible, so there is more here to explain. One could continue. For now, I will stop there.

Conclusions

The original thread finished up with Agnus using the Robin Hanson signal to attempt to put a bet together, which did not work as there was nothing close to a meeting of the minds on what was in dispute. Adam’s final position seemed to be that as long as Twitter polls did not match national polls as accurately as other national polls matched each other then they were useless. It was unclear whether you would be allowed to correct for bias before checking. That seems important given that most national polls are doing various bias-correcting things under the hood.

Adam’s whole position here, to me, is rather silly, even if we limit ourselves to use cases where the Twitter poll is being used only to try and extrapolate towards national sentiment. Of course when we are trying to measure the output of process X we will get a less accurate measure by using process Y than by repeating process X. That is true even if X is not doing as good a job as Y of measuring underlying value V. We still might gain insight into V. We especially might gain insight into V if X costs hundreds or thousands of dollars per use while Y falls under the slogan ‘this website is free.’

The principle mirrors the question about to what extent Proper Scientific Studies are the only form of evidence, making it legitimate to say No Evidence of X whenever there is no Proper Scientific Study claiming X, no matter what your lying eyes think or how many times your lying ears hear “Look! It’s an X!”

Takeaways

  1. All evidence is evidence. All evidence is net useful if well-handled.
  2. Those who deny this are likely epistemically hostile and/or operate in a highly hostile epistemic environment. Treat accordingly.
  3. Do your best to stay out of such places and discussions, when you can.
  4. Biased or misleading evidence is evidence, often of many things.
  5. One must preserving Conservation of Expected Evidence.
  6. Mostly compare information from hostile or biased sources to expectations.
  7. See what is there to be learned, being curious and exploring.
  8. Look for comparisons that let you control for bias. Often quite straightforward.
  9. Never get into a Socratic dialog where a Socratic philosopher gets to ask the questions when death is on the line. Or you want to ‘win.’ Otherwise, sure.
  10. Twitter polls are neat and chances are you are not doing enough of them.
This entry was posted in Uncategorized. Bookmark the permalink.

14 Responses to Twitter Polls: Evidence is Evidence

  1. bugsbycarlin says:

    “Yes, exactly. Everything is evidence. You should updateon almost anything. That is indeed how probability and knowledge work.

    To state the obvious, if evidence does not cause one to be more likely to be led to the correct conclusion, you are doing evidence wrong, bro do you even Bayes?”

    Discriminate. Discriminate often. Your conscious brain does not update on the grey cloud, and the grey cloud, and the grey cloud, and the grey cloud, and the grey cloud, and the grey cloud, and the grey cloud, and the green leaf, and the green leaf, and the other green leaf, and that other green leaf. You do not play chess like an AI from 1954, processing the entire search space and comparing scores. You discriminate, you filter huge chunks of information out as low quality information, and you only consider a very small number of potentially high yield pieces of evidence with multiple pass systems like consciousness or rational thinking.

    Speaking from experience writing one, even actual Bayesian filters (from which we get this silly idea to call ourselves Bayesian) are most effective when they pre-process information and when they winsorize. It is a *fact* that some systems *gain* decision power when they throw out a certain amount of weak evidence.

    That’s the whole point of a p-value, by the way. It’s not a magic number under which God rules that a thing is so. It’s an arbitrary choice to ignore weak evidence, and it has been established both mathematically and evidentially that applying this choice consistently yields more accurate decisions in the aggregate.

    • bugsbycarlin says:

      Clarification: I should have said “consistently applying this choice”, that is, one must use the same cutoff value at all times to get the desired effect, not that the desired effect is itself consistent.

    • bugsbycarlin says:

      Additional clarification: I think asking people stuff by twitter polls is fine. It’s no weaker or stronger than asking your friends a question and then reporting the answer, something we all do.

  2. scmccarthy says:

    If you think people are out to get you, I would expect it to make sense to have a policy of not updating on evidence that doesn’t meet certain standards.

    It also makes sense to have a policy of treating arbitrary sources of social media as being out to get you until they have been vetted to certain standards. It’s a common dynamic there.

    So I have sympathy for the position that you should not take twitter polls seriously. I translate that position as: “Most entities doing informal polls on social media are trying to manipulate people rather than seek the truth. Thus, we should all agree to not pay attention to twitter polls in order to counter that behavior.”

    In this case, I agree there’s information to be gained. And it seemed obvious to me from the start that the distinction between self evaluation and evaluating other people should have a big impact. I am not very surprised by the results.

    • greg kai says:

      I agree, but I also think that “Most entities are trying to manipulate people rather than seek the truth.”, regardless if they mention social media polls, classic polls, or even non-poll based evidence. And yes, it include science, in more and more cases. Basically, as soon as there is a proposed solution or policy, it’s influencing, regardless of what kind of evidence is mentioned. There is information to be gained, but it’s useful to remember that any facts exposed is munition to advance the goal first, maybe information second. Those “facts” are to be treated accordingly (they may be lie, and they sure are handpicked)

  3. Since you presented vote counts and percentages for your poll and Patrick’s, I couldn’t resist doing the statistical significance test (multinomial difference by chi-square test, with a few mutterings about the Bayesian version with a Dirichlet conjugate distribution) to assess the reproducibility.

    Summary:

    The difference turns out to be statistically significant, even after subsampling to get more balanced counts: Patrick found fewer “about the same” than you did.

    However, the effect size is pretty small, i.e., about 5% of votes, so it is unlikely to be a meaningful difference.

  4. A1987dM says:

    “Bad evidence can induce an update in the wrong direction.” — or in the right direction but by the wrong amount (like, orders of magnitude more than you should)

    “All evidence is net useful if well-handled.” — not necessarily, if you count the cost of interpreting it. Is the fact that it’s cloudy in Turin today evidence for or against Ukraine retaking the Donbas before the end of the year? Well, I’d guess the log-likelihood ratio isn’t *exactly* zero, but it’s definitely way too close to zero for it to be worth the effort of figuring out its sign.

  5. magic9mushroom says:

    >Usually, people with such taxonomies will also think that strong evidence by default trumps weak evidence, allowing you to entirely ignore it. That is not how that works. Either something has a likelihood ratio, or it doesn’t.

    It does in the limiting case of “A is impossible if X, so if A happens P(X) = 0 regardless of any other evidence”, and that limiting case, while technically never reached, is approached closely enough often enough for the heuristic to do work.

    Scientific studies rarely warrant that heuristic’s application, though.

  6. Willa says:

    What does “epistemically hostile” mean here? There’s someone actively trying to cause you to believe false statements? I don’t remember covering that in my undergrad epistemology class. 🤔

    • TheZvi says:

      Basically, yes, although ‘statements without regard to their truth value’ rather than false statements.

      • You guys might have a lot of fun with this paper:

        A Kovalczyk & O Chapelle, “An Analysis of the Anti-learning Phenomenon for the Class Symmetric Polyhedron”, Intl Conf on Alg Learning 2005, 78-91.

        Basically, there are datasets where ML algorithms work well on the training set, but worse than random on test sets, under crossvalidation. (It’s a property of datasets with a certain structure, not related to over-training.)

        So an “epistemically hostile” actor could supply you with data that has this property, and trick you into learning something orthogonal to reality.

        I once thought it would be fun to design a game where you’re trying to predict a Cauchy-distributed variable, and seduce players into thinking they could take the average of observations. Cauchy-distributed variables have the property that the Central Limit Theorem doesn’t apply (because of infinite variance). In fact, the standard error of the mean for Cauchy variables goes up with the number of data points, not down like 1/sqrt(N) as the CLT says (usually) happens.

        That’s a (weak!) example of anti-learning; Kovalczyk has gnarlier versions that don’t appeal to tricking people into using the wrong summary statistic.

        • TheZvi says:

          Don’t have time right now to read the text, but you’re saying that there exists a situation where if I learn a random subset of the data then I will do worse than random on the remaining data under standard ML techniques?

          Bonus question, then, is whether HUMANS will also do worse than random in this spot…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s