I appreciate the lack of math notation, for many with a poor mathematics backgrounds it feels like a huge wall into getting into interesting and useful theories.
Bayes theorem is very well suited to this. Frankly, it's one of those rare cases where those without much math might find it easier to read the original paper than many of the introductions...
When I finally got Bayes Theorem I thought it says something obivious in unfamiliar terms.
What made it click for me was realizing that bayesian networks are a mini-language and the Bayes theorem is much more easily explained visually than with formulas. I think teachers should start with telling the correspondence between them and probability terminology.
Here's how I would explain it.
----
In a bayesian network nodes are events, arrows are probabilities.
When you traverse a path made of successive arrows you multiply the probabilities of the arrows you encounter along the path.
When there is more than one path to get from A to B and you want to know the probability of getting from the former to the latter, you sum the probabilities obtained from the various paths.
When you say "probability of A" it's like saying: sum of the paths that get to A.
When you say "probability of A and B" it's like saying: sum of the paths that include both A and B.
When you say "conditional probability of B given A" it's like saying: starting from A, sum of the paths that lead to B.
----
Let's do a simple application. This is a tree that doctors should find familiar and from which i understood it.
/T+
D+/
/\
/ \T-
/
\
\ /T+
\/
D-\
\T-
Starting from root, at the first bifurcation we have: probability of having a disease or not. At the second bifurcation we have: probability that a diagnostic test tells either "positive" or "negative".
Usually doctors can estimate the values of the single arrows of this tree.
Let's say I told you: what's the conditional probability of having a positive test given the patient has the disease? Given what we said, you just put your pencil on D+ and follow the path to T+: just 1 arrow, no need to multiply (it's called the "sensitivity" of the test).
What's the probability of having a positive test randomly extracting a person from population? Since we don't start with a patient that has or not a disease, we put our pencil on root. There are 2 ways of getting to a T+: root-->D+-->T+ and root-->D- -->T+. As we said above, while following each of the paths we multiply the arrows we encounter and then we sum the result of the 2 paths.
And finally: what's the probability of our patient having the disease given that the test says "positive"?
We said we have 2 ways to get a positive test, but in only one of these ways our patient really has the disease, so we just divide the probability given by the only path that contain both D+ and T+ by the probability given by all paths that lead to T+. We are just saying that true positive are a fraction of all positives (seems obvious to me?). Numerator is the only "test is positive and it's true" path. Denominator is the sum of all "test is positive" paths.
Well, guess what we just did:
P(D+|T+) = ( P(T+|D+) P(D+) ) / P(T+)
(Additional intuition: another way to see it is that what we did corresponds to mapping the tree we started from to a flipped one in which the first bifurcation is T+/T- and the second one is D+/D-)
Oh yeah, and the first actually usable form of Bayesian Theorem would be probabilistic graphical models with max-sum algorithm. Good luck mastering that quickly or at all!
That is far from the first usable form of Bayes. I have no idea what point you are making.
Bayes Theorem is easily derived algebraically using conditional probability and the chain rule. You can also derive it easily with a Venn diagram. There is barely any notation needed at all here to understand it.
If you're struggling with things at that level, it is more likely due to your own laziness, not because the math is hard. Because it is very easy to reason about.