Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Markov chain Monte Carlo without the bullshit (2015) (jeremykun.com)
233 points by larve on Oct 25, 2022 | hide | past | favorite | 53 comments


In the 21st century, statistics has had the odd distinction of being thrust into the spotlight in a way that it never has before in its centuries of existence. Before now, practitioners have mostly been on an island...or more like several islands that are multi-day journeys from each other by rowboat. It's created weird terminology, and even the initiated don't all use the same jargon. I have a degree in statistics, and one thing I learned in school is that if you pick up a textbook the first thing you have to do is figure out its internal terminology. Only the basest concepts or ones named after people tended to use the same names everywhere.

I think this is why computer science has been so successful at co-opting a lot of statistic's thunder. It's reorganizing a lot of concepts, and because everyone is interested in the results (image generation, computer vision, etc) it's getting a lot of adoption.

Amusingly, I thought that the blurb on MCMC the author quoted was pretty clear. That doesn't happen to me often.


> I think this is why computer science has been so successful at co-opting a lot of statistic's thunder. It's reorganizing a lot of concepts, and because everyone is interested in the results (image generation, computer vision, etc) it's getting a lot of adoption.

I think it's also that "computer science" can be perceived as having succeeded where statistics failed. And the terminology is frankly a lot more appealing: would you rather "train an AI algorithm", or "fit a model"?

Even the recasting of "model" as "algorithm" has a marketing benefit: models have uncertainty and uncertainty is scary, whereas algorithms are perceived as precise and correct.


Where are you viewing statistics as having failed, precisely?

Lots of well-documented success in traditional manufacturing and logistics type things. The "AI winter" wasn't a statistical thing at all - rather, stats have succeeded where logical programming had failed. Branded as "machine learning" or "data science," sure, but I don't recall a time of handwringing over stats failing in between.

(And I've heard "train a model" a thousand times more than "train an algorithm" from data folks in industry...)


> Where are you viewing statistics as having failed, precisely? <etc>

I don't think it's failed. But it certainly failed to capture the public imagination. Some people I talk to (fewer now than in the past) seem to think that statistics is old and irrelevant, and that machine learning is new and changing the world. Those people also don't usually have any idea of what they're talking about, but the idea is out there.

> "train a model"

But how often do mass media or industry publications talk about "models" as some kind of new hotness? It's all about "AI algorithms" now.

The terminology usage is especially surreal in the "digital insurance" space [0], where content marketing writers churn out breathless articles about how AI is the next big thing, as if insurance execs didn't understand what a model was!

[0]: https://www.acko.com/digital-insurance-trends-and-benefits/


It is my understanding that AI is being used in insurance not to supplant the risk models, but to automate human-driven tasks, such as reviewing claims or documents, or to augment risk models with new data.


That's correct. But the difference between an AI claim processing algorithm and the risk model lies more in how the model is used than in how it actually works.

Of course, the technical details of a text transformer differ significantly from the technical details of a decision tree or logistic regression. But when you zoom out a little, it's clear that many of the same core principles and techniques are used in both cases.

So the content marketers are right in that the next big thing is automating tasks like claim processing. But I find it at least a little bit silly because, at its core, an AI claim processing algorithm is not so conceptually different from a pricing model, and it's well within the realm of understanding for many people on the quantitative side of the insurance business.

I also want to avoid making particular claims or judgments about what AI "is". I will leave that to the futurists, philosophers, and AI researchers.


Usually I hear “training a model”


When I hear "training a model" I translate that to y=mx+c


I think of Y = f(X). f can be anything from tree based model to a neural network to nonparametric model.

Y = Mx + c is simple and common but by no means the only kind of model.


I think what the poster was implying was that most people create simple models, because they have a linear thought process, but they call it a model to sound sophisticated.


I was going for they call it a model to sound sophisticated - they call it training a model when they are really doing a linear regression.

Technically doing a pair of linest in excel and a goal seek to weight the output of each is Ai.

Sometimes the Groundbreaking Ai startup that just raised a gazillion dollars in their 10th up round, is just a plain old basic high school statistical model.


Linear regression is a legitimate model, and fitting its parameters is definitionally machine learning.


> in a way that it never has before in its centuries of existence

I'm not surprised by this at all. There has been an explosion in statistical research really since Fisher. While Bayesian Statistics was invented in the late 1700's, it wasn't really widely used until the 1950's (Fisher coined the term btw). The explosion in computing resources also caused an explosion in statistics, and especially in Bayesian Statistics. It isn't computing leveraging statistics but a symbiosis. Causal Statistics was pretty much non existent prior to the 21st century and major players like Judea Pearl didn't make their ground breaking works until then either.

It should be absolutely no surprise that statistics wasn't in the spotlight before. It was harder to do without computers. Obtaining vast amounts of data was substantially more difficult. Especially with respect to accuracy. Determining causal relationships was essentially a crap shoot.

But this is also a great example of why your theoretical work may not be useful now but can be the cornerstone of modern science a century later.


This. Statistics is a very young field. The entire idea it rests on[1] is highly unintuitive, so it's no wonder it takes time.

----

[1]: Exchangeability, the fact that you can make useful progress in analysing a case by ignoring most of the information about that specific case and instead lumping it into a reference class of "similar" cases, where similar ultimately rests on subjective judgment, other than in a few special cases.


I definitely agree that statistics is a young field, but I would not agree with your definition. This is naive and statistics is much more. As HN readers that know me will attest, I'll frequently rant about how low order approximations can frequently lead to inaccurate or even results that lead in the wrong direction. Statistics actually allows us to see this! There's many famous paradoxes that are really dependent upon this. For simplicity sake's I'll reference Simpson's[0] and Berkson's[1] which happen because of inappropriate aggregation (which inappropriate aggregation is one of the most common errors in statistical modeling, but not always easy to notice).

Rather what statistics is about is modeling things that either have imperfect information and/or are probabilistic in nature (yes, there are real probabilities). The point isn't to ignore information (we actually want as much as we can get, which is why we're in the information age and why statistics has exploded with innovation) but to recognize that with observations and sampling that we can extract patterns that still allow us to make predictions without all the information. For example, statistics is commonly used in thermodynamics because there is imperfect information[2], in that we can't measure where each individual particle is at a given point in time. Instead we can aggregate information about the particles and then make aggregated predictions about future states. It is important to note that the predictions are also probabilities.

[0] https://en.wikipedia.org/wiki/Simpson%27s_paradox

[1] https://en.wikipedia.org/wiki/Berkson%27s_paradox

[2] thermodynamics also involves true probabilities but I think the case is still motivating. We can also use statistics with purely deterministic processes with perfect information, but it is frequently computationally inefficient.


I agree with everything you say! But I would still argue that the things you describe (prediction of the partially known based on limited information) is built on probability theory (sort of applying it in reverse), which in turn is based on assigning cases to reference classes and treating them as exchangeable in the ways that matter for the analysis at hand.


To be fair, naming things is also one of the hard problems in computer science.

Graphs may have vertices or nodes. Traversals of a graph may be called paths and walks or simple paths and paths. The length of a path may be equal to the number of nodes or edges in it. The height of a tree may be equal to the depth of the deepest node, or it may be off by one in either direction.

In my subfield, some people talk about strings and substrings while others talk about words and factors. A Lempel–Ziv compressor may factorize the input or parse it into phrases. The Burrows–Wheeler transform, the FM-index, and the compressed suffix array may all refer to the same thing, or there may be 2 or 3 distinct meanings.

I don't think I've ever written a paper without naming at least one new concept and renaming at least one old concept.


The quote from the article in Encyclopedia of Biostatistics is awash in undefined terminology sometimes about peripheral issues.

Clean, logical, all terms well defined and explained, with plenty of advanced content, is in (with some markup using TeX)

Erhan \c Cinlar, {\it Introduction to Stochastic Processes,\/} ISBN 0-13-498089-1, Prentice-Hall, Englewood Cliffs, NJ, 1975.\ \

The author was long at Princeton. He is a high quality guy.

As I was working my way through grad school in a company working on US national security, a question came up about the survivability of the US SSBN fleet under a special scenario of global nuclear war but limited to sea. Results were wanted in two weeks. So, I drew from Çinlar's book, postulated a Markov process subordinated to a Poisson process, typed some code into a text editor, called a random number generator I'd written in assembler based on the recurrence

X(n+1) = X(n) 5^15 + 1 mod 2^47

and was done on time.

A famous probabilist was assigned to review my work. His first remark was that there was no way for my software to "fathom" the enormous "state space". I responded, at each time t, the number of SSBNs left is a random variable, finite, with an expectation. So, I generate 500 sample paths, take their average, use the strong law of large numbers, and get an estimate of their expected value within a "gnat's ass" nearly all the time. "The Monte Carlo puts the effort where the action is."

The probabilist's remark was "That is a good way to think of it."

Need to do some work with Markov chains, simulation, etc.? Right, just read some Çinlar, not much in prerequisites (he omitted measure theory), get clear explanations, no undefined terminology, from first principles to some relatively advanced material, and be successful with your project.


Thank you for the recommendation. The book looks serious to me - it reminds me my math lectures during my chemistry studies in university. How much time would you think does it take to work through it to get some value/insight? Is this even the right material for such endeavour?


For background, need only some routine calculus and the most elementary probability theory.

The chapter on the Poisson process, early in the book, is nice to amazing, e.g., his axiomatic derivation. You should be able to get the main ideas in 1-2 evenings of fun, pleasant reading.

Science is awash in Poisson processes; it is nice to understand more about them. That they have no memory seems to me like maybe a thin thread could pull to say some things about what is going on in the internals of some of those cases of Poisson processes. I.e., you are looking at some system with some unknown internals, some working mechanism, but the evidence is that it has no memory. The system has some internal state but no memory? How can that be? So, we have some possibly severe limitations on how the internals work. Now I'm concentrating mostly on my startup and deliberately setting aside pursuing such questions -- maybe later. But you, maybe in chemistry, quantum lifetimes, time until a nucleus splits, ..., glitches in complicated computer systems, ....

Similarly for the other chapters.

If you take the book fully seriously, e.g., carefully follow all the proofs and work most of the exercises, it's a good one semester course.


> Science is awash in Poisson processes; it is nice to understand more about them (...)

Agreed :) Although I never dared to explore them in the depth you are describing in your text.

> If you take the book fully seriously, e.g., carefully follow all the proofs and work most of the exercises, it's a good one semester course.

Thank you for an approximate time quantification.

As a side question - judging from you comments you seem to the right person to ask: What kind of approach would you recommend to get a level in statistics strong enough in theoretical and practical aspects for industrial application? Would you be so kind to recommend any courses or books?

> Now I'm concentrating mostly on my startup

Out of curiosity: What are of industry are you operating in?

Cheers!


Thank you for the book recommendation. I have gained a lot from reading your math study recommendations over the past few years. I wish I had more time/motivation to fully follow-through with them.


As in the book, a Poisson process is an arrival process -- omitting the math, some of which is amazing, the vanilla example is Geiger counter clicks. One of the special things, actually an easy calculus exercise, is: Suppose the clicks come on average once a second. Suppose have not gotten a click for time t >= 0. E.g., maybe t is 2 seconds. Now what is the average time to the next click, less than the 1 second? Nope, the average now (a statement in conditional probability) is STILL the 1 second.

For a Markov chain, that is a discrete time, discrete state space Markov process. Halt: In a Markov process, at any time t, the past and future are conditionally independent given the present. E.g., for predicting the future, if know the present, then also knowing the past does not help (or hurt, really). So, Markov is an enormous simplification. In particular, a Poisson process is a Markov process.

For the discrete time, discrete state space Markov process, i.e., a Markov chain, subordinated to a Poisson process, the time to the next change (jump) is a Poisson process. A discrete time, discrete state space Markov process does not have to have anything to do with a Poisson process and, instead, can have, say, one jump an hour on the hour.

But with subordinated to a Poisson process, the expected time to the next jump (arrival), the parameter of the process, can depend on the state.

E.g., in the war at sea problem, as the ships sink, the rate at which they sink can slow down.

There was a WWII calculation by Koopmans that gave the encounter rate of multiple ships at sea. The formula used the area of the sea, the number of ships, etc. So, that is where I got the encounter rates.

Then the main input was the number of Red ships, the number of Blue ships, the speed of each ship, and for any Red-Blue encounter the probability the Red dies, the Blue dies, both die, or neither die.

That was enough data to generate sample paths (war simulations) of the assumed scenario. Silly, sure, but okay for one grad student in a rush in just two weeks!

The above may be enough for you to read and understand the details in the book quickly. But for more the book also covers limiting distributions (as time evolves, the probability distribution of the states approaches a limit), Kolmogorov equations, branching processes, etc. An example of branching processes is some of the simple cases of nuclear fission.

Now you have saved quite a lot of time and gotten ready for a fairly easy time in the book!


Thank you for that explanation!


Recently watched this on Hamiltonian Monte Carlo, and it was a fantastic primer on the basic concepts and motivations of mcmc, and later hmc https://youtu.be/pHsuIaPbNbY


As someone who took statistics 30 years ago and promptly forgot most of it, I followed everything except "I want to efficiently draw a name from this distribution". What makes a drawing efficient?


Look, if you want to draw from a unit circle, you could enclose the unit circle inside a square. If the center of the unit circle is the origin, it should be clear from some middle school geometry that such a square has upper left cartesian coordinates at (-1,1), bottom right at (1,-1). So its a square of side two. Now you can sample from this square by calling the rand() function twice in any programming language, scaling & translating. So def foo() { -1 + 2*rand() } will do the trick in c/c++/scala/python/whatnot ( I've scaled by two and translated by minus one). So you have your two random variables. Pair them up & that's your tuple (x,y). Now if you make 100 such tuples, not all the tuples will lie inside the unit circle. So you have to toss out the ones that don't. So your drawing isn't 100% efficient. How efficient is it ? Well if you toss out say 21 of those 100, your sampler is 79% efficient. Now where the fuck does the 21 come from ? Well, if you use some high school geometry, unit circle has area pi and that enclosing square has area 4, so pi/4 is approximately 79%, so 100-79 is 21 and so on...So one can construct more efficient samplers for the unit circle by not being so foolish. We should stop enclosing circles in squares & listen to Marsaglia. He died a decade back but before his death he solved the above problem, among others, so we don't waste 21% of our energy. That said, most programs I've seen in banking, data science etc are written by programmers, not statisticians. So they happily use an if statement & reject x% of the samples, so they are super-inefficient. Drawing can be efficient if the statistician codes it up. But that fucker wants to use R, so given the choice between some diehard R fucker & python programmer who can mess around with kubernetes & terraform in their spare time, hapless manager will pick the python programmer everytime, so that's what makes the drawing inefficient. /s tag, but not really. Just speaking from bitter personal experience :)


I have to agree, we could save the world a lot of wasted energy if there were a way to get statisticians off of R/matlab and into more 'portable' spaces.


R contain so many statistical function in basic function set. I can understand why they do not want to leave.


Isn't the obvious solution to sample in polar coordinates instead? 0-1 for the radial coordinate, and (0 - 1) * 2pi for the angle.


If you do this you won't get a uniform distribution on the circle, the points will be the most dense at the center and get less dense as you go towards the edge. To make the points uniform you need to use inverse transform sampling[0], which will give the formula r*sqrt(rand()) for radius poolcoordinate, where r is the radius of the circle and rand() returns and uniform random number from the interval 0 to 1.

[0]: https://en.wikipedia.org/wiki/Inverse_transform_sampling


Yes, though you have to take the square root of the sampled radius for the resulting distribution to be uniform on the unit circle. (The area of the donut with r>0.5 is greater than the area of the circle with r<0.5, but the naive implementation would sample from each of those with probability 0.5.)

It’s still a useful illustration, though, since MCMC samplers used in practice do end up throwing away lots of the sampled points based on predefined acceptance criteria.


Computational efficiency is a major consideration in sampling from a high dimensional distribution. https://en.wikipedia.org/wiki/Rejection_sampling#Drawbacks


If you want to select an integer uniformly random from 0...n-1, you need an expected logn mutually independent random bits. What if you don't want it to be uniformly random, but some other distribution instead? That's where Markov chains help; they use random bits efficiently to draw from an interesting distribution.


Yeah the article really disappoints right away by saying something seemingly arbitrary in the first paragraph


One possible cost measure is thr number of evaluations of the probability function.


A bit weird to develop everything in terms of a finite probability space. I guess it helps avoid fiddly technical issues of measurability. But we don't want the reader to believe that MCMC only works for finite sets. I think the most famous applications are over uncountable sets. Of course everything is technically finite due to floating-point math, but graph walks don't really capture the spirit of MCMC over finite-precision approximations of continuous probability spaces.


The article says that we sample random values by walking on a grid. But doesn't it mean that this way the values that we draw will be close to each other?

For example, if we have a 2-dimensional grid of size 1000x1000 and make 10 steps, all the values will be close to each other. Doesn't look like a good random sample.

Knowledgable people, please explain.


I don’t think that this exposition is any more or less clear than any number of the same in any number of books. Sure, if you just read a 100 word summary, it sounds like the quoted example of “BS”, but any reasonable text also provides an explanation that’s essentially the same thing as the extended explanation given by the author.


I like the article. That said, if you are expecting biostatisticians to give the best explanation of the biggest hammer in the Bayesian toolbox, you may be looking in the wrong place.

Folks in spatial stats, machine learning, and physics have some really nice introductory material.


I'm not so sure about that, Richard McElreath's Statistical Rethinking is an amazing resource.


I don’t think Richard would ever describe himself as a statistician? He seems just as frustrated by the bad terminology and convoluted explanations as anyone.


Statistical Rethinking is probably the best textbook that I've ever read. It basically jumped started my career in data!


Would you suggest some?


Excellent piece.

I strongly agree that most of the existing literature overcomplicates the subject. It's likely that most technical authors are excellent at the subject they write about, but not good at writing in general.

Authors should put themselves in the reader's shoes more often.

And no, I'm not saying things should always be simplified. When you write a book you do define an MQR (minimum qualified reader) and write for them. The problem I'm talking about here is that most of the aforementioned content hasn't even defined an MQR or has defined one poorly and is not taking it into account.


As I’m about to write a technical book (my first was pretty bad, but sold reasonably well), I found “Write useful books” by Rob Fitzpatrick which approaches writing a “useful” book (i.e., something that really teaches you something) as product design, by basically product-designing your book through multiple beta-reader iterations, really drilling it down on making it useful for the readers. Part of that is cutting down on all the things that impede making progress from applicable piece of knowledge to applicable piece of knowledge such as preface, introduction, long theoretical underpinnings, etc…

I think that most textbooks ought to be “useful” books first, rather than a mix of reference material and pedagogy, tediously plodding through things in a sequence that doesn’t make didactic sense.


> It's likely that most technical authors are excellent at the subject they write about, but not good at writing in general.

Could be that, or it could be that they're keeping things deliberately complex so they have a moat around their own area of specialization.


I quite enjoyed McElreath's Statistical Rethinking including on this topic.

https://www.youtube.com/watch?v=Qqz5AJjyugM


I love Markov chains! But, I too don't like "terminology, notation, and style of writing in statistics"... so, I built a simple "Rosetta Stone" (Python <> Swift) library implementation of Markov chains here: https://github.com/maxhumber/marc


This is so useful.


Be careful - removing the fancy words from "Markov Chain Monte Carlo Simulations" and translating it into English can have a negative effect professionally.


Here's my attempt to explain MCMC in a few paragraphs.

Drop a droplet of ink in a glass of water. In time the ink will spread out and after a few minutes it will be homogeneously distributed throughout the whole glass. Why is it so? When it gets to the steady state, you can take any cubic millimeter of water, and any given second there will be as many particles of ink leaving that cube as entering the cube. In other words, each cubic millimeter of water is in balance.

What's that got to do with Monte Carlo? Many problems can be reduced to calculating the expected value of a function of a multi-dimensional random variable. All you need to do is take many samples of that variable apply the function, and then take the average. Simple, right? Yes, fairly simple, but how do you draw samples from a multi-dimensional distribution?

You pretend you are letting a drop of ink diffuse. You need to make it so that in the steady state, the ink's concentration is the pdf of the distribution you want to draw from. You want to make use of the balance: in each tiny little cube, as many particles of ink enter as they leave.

Nobody knows how to enforce this balance. It's too hard: a particle can leave in many ways, and can enter from many places.

But when something is hard, sometimes it's simpler to solve a more difficult problem. In this case, you enforce a more stringent condition, called "the detailed balance". You make sure that for any two little cubes, as many particles migrate from one to the other as migrate the opposite way.

So, let's say you take point A and point B. The pdf function at the two points has values P(A) and P(B). You have the freedom to choose the transition probabilities P(A->B) and P(B->A). How do you choose them? Well, you enforce the "detailed balance": how many "particles of ink" go from A to B? The number is proportional to the probability of a particle being at A to begin with, and then transitioning from A to B. So, it's P(A)P(A->B).

Great. The detailed balance condition is P(A)P(A->B) = P(B)P(B->A).

Is there a simple way to get these transition probabilities? Here's a simple example: let's say P(A) is twice as high as P(B). We need to make P(A->B) be just half of P(B->A). We do it in two steps: first we make P(A->B) and P(B->A) equal (for example we draw from the uniform distribution for both) and then we flip a coin, and if it's heads we accept the move from A to B and if it's tails we stay in A. This recipe (called Metropolis) does not work just for P(B)/P(A) = 1/2, but for any ratio (of course you need to use a biased coin).

So, that's MCMC. You start from an arbitrary point, draw from the uniform distribution, check the ratio of the probabilities of the end and start point, and if the ratio is less than one, you flip a biased coin and accept the move with the ratio of those probabilities, otherwise stay put. Rinse, repeat. After many steps you end up with a point that's randomly distributed with the target probability.

Oh, and there's a bonus. Since the acceptance ratio only depends on the ratio of the pdf at two points, you don't actually need to know the pdf itself, it's ok if you know the pdf up to a normalization constant. Which is exactly the situation when you do Bayesian estimation. That normalization constant is very difficult to calculate (it is itself a multi-dimensional integral). This lucky strike is what allowed Bayesian estimation to exist. Without it, all the field of Bayesian estimation would be just idle speculation.


Full list of primers from Jeremy Kun (of which the submitted page is one):

https://jeremykun.com/primers/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: