I've long argued that the biggest problem with orthodox NHST for A/B testing is that you actually don't care about 'significance of effect' as much as you do 'magnitude of effect'. Furthermore, p-values tell you nothing about the range of possible improvements (or lack thereof) you're facing. Maybe you are willing to risk potential losses for potentially huge gains, or maybe you can't afford to lose a single customer and would rather exchange time for certainty.
My favored approach I've outlined here[0]. Where the problem is basically considered one of Bayesian parameter estimation. Benefits include:
1. Output is a range of possible improvements so you can reason about risk/reward for calling a test early.
2. Allows the use of prior information to prevent very early stopping, and provide better estimates early on.
3. Every piece of the testing setup is, imho, easy to understand (ignore this benefit if you can comfortably derive Student's T-distribution from first principles)
>The type of analysis being banned is often called a frequentist analysis
I find that there is a trend of associating "bad statistics" with "Frequentists Statistics" which isn't really fair. If you found a statistician trained only in Frequentist methods and asked their opinion on experiment design in psychological research they would likely be just as appalled as any Bayesian.
I'm a big fan of Bayesian methods, but the solutions of "we'll solve the problem of misunderstanding p-values by removing them!" is still a problem of misunderstanding p-values! The misunderstanding is the issue, not the p-value.
I think what this journal doing is probably a good thing, but only as the lesser of several evils. The truth is something more like... it is easier for soft sciences to abuse frequentist statistics than bayesian. Both have merit, it cannot be argued, but it is simply easier to produce meaningless conclusions with frequentist statistics done wrong.
This situation is so bad that it merits banning frequentist for this journal and I think that's reasonable. This doesn't mean that every journal in every field should, but perhaps it will be a useful temporary measure to improve quality.
The problem is that p-values are begging to be misunderstood, and in fact you cannot use them as a decisionmaking procedure without "misinterpreting" them – after all, you're deciding whether to accept the hypothesis P(HA|D) based on 1-P(D|H0) on the grounds that, while they're not the same, they're proportional. (In that sense the p-value is like the poor man's likelihood ratio.) There's nothing wrong with p-values as a concept, but there's everything wrong with p-values in hypothesis testing. The misunderstanding is baked in.
You can update your posterior based on the p-values yourself though. "Well those eggheads may have disproved X, but X is just common sense, so I'm gonna keep believing it anyway. U-until I see more studies confirming the finding I mean."
I think the problem is not "if you find a frequentist (as opposed to bayesian) statistician", but "if you find a frequentist (as opposed to bayesian) e.g. biologist".
Non-statisticians have been trained using bad, frequentist methods, and one way of forcing them to retrain is by forcing them to learn new statistical tools to get published.
For anyone interested in Bayesian Statistics it is worth noting that ET Jaynes (imho the arch-Bayesian) disagrees with Kahneman's idea that "people reason in a basically irrational way". Jaynes died long before "Thinking, Fast and Slow" was published, so his critique is based on Kahneman and Tversky's early work on the subject.
Kahneman and Tversky's critique of Bayesian analysis is basically: If more data should override a prior belief, then why is it as more data comes in people have increasingly divergent opinions? For example we have 24 hour news media throwing information at us and people only seem to be more divided politically. If we reasoned in a Bayesian manner then our opinions should converge, which they clearly do not.
Jaynes' answer is really fascinating and is covered in the chapter "Queer uses for probability theory" from 'Probability theory: The Logic of Science'. Basically Jaynes' argues that we are never really testing just one hypothesis. He gives an example of an experiment designed to prove ESP, and points out that no matter how low of a p-value the experiment reports, if you have a strong prior belief that ESP does not exist the evidence won't convince you. He argues this is because you actually have other hypotheses with other priors: The subject is tricking the experimenters, there is an error in experiment design, the people running the experiment are intentionally being deceptive etc.
He then shows that if your prior belief in ESP is sufficiently lower than your prior belief in these alternative hypothesis, not only will further evidence fail to convince you of ESP, but will actually increase your belief that you are being lied to in some way. So while Jaynes agrees that these priors may be irrational, our reasoning given new information is completely rational.
This is interesting, but this seems like a pretty minor thread in Kahneman and Tversky's body of research. The review highlights a number of experiments that show how actual human reasoning differs from maximizing utility. The critique of Bayesian analysis isn't mentioned in the article, and I don't recall it being in the book.
If you have any interest in how people make decisions then "Thinking, Fast and Slow" is worth reading.
> The review highlights a number of experiments that show how actual human reasoning differs from maximizing utility.
The conventional definition of utility is pretty strict, for example, it must be a function of the final outcome only, and there is no model uncertainty, which would go under the title of ambiguity aversion instead. So maybe it's not that surprising that a really specific narrow concept doesn't describe all of human behaviour and needs to be extended. But since you can extend it sufficiently to describe some interesting behaviours, is it really necessary to focus specifically on the utility function, and not the other things that people might be maximizing?
Sorry, I don't really understand your comment. Assuming that people maximize utility is a useful model for certain tasks. Kahneman's work shows that people's decisions differ systematically from any kind of rational maximization, and are explained better when you allow for biases such as anchoring, loss aversion, and substituting a hard question for a related easy question.
My point is that utility is a rather narrowly-defined concept, so if you find a situation where people don't seem to be maximizing any utility function, one of the possibilities is that the concept of a utility function is too narrowly-defined. Things like anchoring, loss aversion, ambiguity aversion can all be modelled: the only thing you lose is the name "utility function". Maybe the utility function needs to depend on the entire history of states (for loss aversion), or maybe the question being asked is subject to uncertainty, or there is a fundamental amount of model uncertainty. All of those can be modelled in probabilistic terms.
So if rational maximization means that people have a utility function that they maximize, then yes, rational maximization is not what people do. But that is partly the fault of how the definition of utility was chosen.
Von Neumann and Morgenstern showed that, as long as people can consistently order choices (ordinal utility), there is a utility function (cardinal utility) they are satisfying.
What Kahneman and Tversky observed is that people don't even choose consistently. It depends on how the choices are presented. For instance, whether the subject frames an outcome as a loss or a smaller-than-expected gain. No matter how you define a utility function, it will not always be maximized. So, it's not a question of defining the function less narrowly. You can present two games with mathematically identical sets of outcomes and people consistently rank the outcomes differently.
Anyway, it's a very good and important book, and doesn't have much to do with Bayesian statistics.
[Ninja-edited since HN doesn't let me respond further below...if you can show the outcomes K & T observed are in fact consistent with a more broadly defined utility function, then you too can win a Nobel prize!]
> What Kahnemann and Tversky observed is that people don't even choose consistently. It depends on how the choices are presented. For instance, whether the subject frames a choice as a loss or a smaller-than-expected gain. So, no matter how you define a utility function, it will not always be maximized.
I disagree, I think you are assuming that people accept questions at face value and unfailingly trust the experimenter. Then equivalent but differently-stated problems would be equivalent, and you would reach that conclusion.
But when people use heuristics, those heuristics are grounded in their experience, and are like a prior on the meaning of the question. Stating the same question in two different ways and getting different answers means either that there is no utility function, or that the "utility function" depends (through model uncertainty, for example) on the exact phrasing of the question.
My point is that these discussions are very closely tied to the kinds of assumptions you make about how people reason, what is rational, and what inputs the utility function has. Kahneman and Tversky got around this problem, I think, by doing something eminently reasonable: postulating a clear and unambiguous definition of a utility function. But the concept of "rationality" is richer than that, so the conversation should not stop there.
> postulating a clear and unambiguous definition of a utility function. But the concept of "rationality" is richer than that
The word "rationality" may be ambiguous, as most words describing anything complex are, but the authors attempted to provide a clear model and work within those bounds. When we begin discussing the ideas informally, and using terms in a broader and more colloquial sense, then we're at fault if the results have become muddied.
The authors demonstrated a reasonable utility function, one which most people upon reflection would agree is logical, and demonstrated that people do not consistently act in a way that maximizes that function.
We can always move the goal post, and claim that if people appear to be acting irrationally it's because we simply don't understand their concept of rationality (or the more complex function they're maximizing). But that seems rather circular; it would be nice hear examples of a richer concept of rationality, in the context of the author's experiments, that might explain seemingly inconsistent behavior.
> Kahneman and Tversky's critique of Bayesian analysis is basically: If more data should override a prior belief, then why is it as more data comes in people have increasingly divergent opinions? For example we have 24 hour news media throwing information at us and people only seem to be more divided politically.
Does that example really fit their critique? We sometimes pretend we can compress a person's beliefs down to a single bit -- red vs blue -- and when you consider how many bits you'd expect your own political views to require it's unbelievable that anyone could offer a single bit a meaningful summary of them. The idea that anyone could even dream that a single bit could be a meaningful lossy compression of my views indicates there is not much meaningful diversity in the population.
Yes! Sorry for not being more clear in my comment.
And just to be clear Jaynes is not explicitly arguing for or against the idea of ESP, though he himself does not believe he points out that there are many prior beliefs widely held in science that change dramatically over time. What he is saying if that if your prior belief in ESP is dramatically lower than your belief that people trying to prove ESP to you would deceive you in some way, the ESP is real hypothesis will never gain enough evidence to overcome the "I'm being tricked" hypothesis.
I wasn't familiar with that area of Jaynes work. Your description of it is fascinating, especially given the modern contexts of scientists trying to convince climate change deniers as well as doctors with anti-vaccine groups.
that book - and that chapter specifically, are absolutely fascinating. i have read that book three times and do not feel like i have a firm understanding of the content yet (try to read it once a year). i also like his example of polarized political views
think fast and slow is also an amazing book - and i recommend it to almost everyone that i talk to about books. it is not a quick read (it's almost 500 pages i believe), but you will never think about advertising or your brain the same way after you reading about anchoring , substitution, and framing
My favored approach I've outlined here[0]. Where the problem is basically considered one of Bayesian parameter estimation. Benefits include:
1. Output is a range of possible improvements so you can reason about risk/reward for calling a test early.
2. Allows the use of prior information to prevent very early stopping, and provide better estimates early on.
3. Every piece of the testing setup is, imho, easy to understand (ignore this benefit if you can comfortably derive Student's T-distribution from first principles)
[0] https://www.countbayesie.com/blog/2015/4/25/bayesian-ab-test...