Can someone explain why the "no free lunch theorem" does not cause problems here...

tech_ken · on Sept 26, 2023

Two explanations

First: Prophet is not actually "one model", it's closer to a non-parametric approach than just a single model type. This adds a lot of flexibility on the class of problems it can handle. With that said, Prophet is "flexible" not "universal". A time series of entirely random integers selected from range(0,10) will be handled quite poorly, but fortunately nobody cares about modeling this case.

Second: the same reason that only a small handful of possible stats/ML models get used on virtually all problems. Most problems which people solve with stats/ML share a number of common features which makes it appropriate to use the same model on them (the model's "assumptions"). Applications which don't have these features get treated as edge-cases and ignored, or you write a paper introducing a new type of model to handle it. Consider any ARIMA-type time series model. These are used all the time for many different problem spaces, and are going to do reasonably well on "most" "common" stochastic processes you encounter in "nature", because its constructed to resemble many types of natural processes. It's possible (trivial, even) to conceive of a stochastic process which ARIMA can't really handle (any non-stationary process will work), but in practice most things that ARIMA utterly fails for are not very interesting to model or we have models that work better for that case.

GraceCat123 · on Sept 27, 2023

These insights are really awesome! It reminds me of the common aphorism in Statistics: 'All models are wrong, but some are useful.'These insights are really like a wake-up call, thank you!

wgd · on Sept 26, 2023

Disclaimer: I haven't looked at the linked library at all, but this is a theoretical discussion which applies to any task of signal prediction.

Out of all possible inputs, there are some that the model works well on and others that it doesn't work well on. The trick is devising an algorithm which works well on the inputs that it will actually encounter in practice.

At the obvious extremes: this library can probably do a great job at predicting linear growth, but there's no way it will ever be better than chance at predicting the output of /dev/random. And in fact, it probably does worse than a constant-zero predictor when applied to a random unbiased input signal.

Except that it's also usually possible to detect such trivially unpredictable signals (obvious way: run the prediction model on all but the last N samples and see how it does at predicting the final N), and fall back to a simpler predictor (like "the next value is always zero" or "the next value is always the same as the previous one") in such cases.

But that algorithm also fails on some class of inputs, like "the signal is perfectly predictable before time T and then becomes random noise". The core insight of the "No Free Lunch" theorem is that when summed across all possible input sequences, no algorithm works any better than another, but the crucial point is that you don't apply signal predictors to all possible inputs.

Another place this pops up is in data compression. Many (arguably all) compressors work by having a prediction or probability distribution over possible next values, plus a compact way of encoding which of those values was picked. Proving that it's impossible to predict all possible input signals correctly is equivalent to proving that it's impossible to compress all possible inputs.

Another way of thinking about this: Imagine that you're the prediction algorithm. You receive the previous N datapoints as input and are asked for a probability distribution over possible next values. In a theoretical sense every possible value is equally likely, so you should output a uniform distribution, but that provides no compression or useful prediction. Your probabilities have to sum to 1, so the only way you can increase the probability assigned to symbol A is to decrease the weight of symbol B by an equal amount. If the next symbol is A then congratulations, you've successfully done your job! But if the next symbol was actually B then you have now done worse (by any reasonable error metric) than the dumb uniform distribution. If your performance is evaluated over all possible inputs, the win and the loss balance out and you've done exactly as well as the uniform prediction would have.