>> How was early Machine Learning different from statistics? Some of the very ea...

>> How was early Machine Learning different from statistics?

Some of the very early work in machine learning, in the 1950's and '60s was not statistical. The first "artificial neuron", the Pitts & McCulloch neuron, from 1938 was a propositional logic circuit. Arthur Samuel's 1952 checkers-playing programs used a classical minimax search with alpha-beta pruning.

Machine learning in the '70s and '80s was for the most part not statistical, but logic-based, in keeping with the then-current trend for logic-based AI. Early algorithms did not use gradient descent or other statistical methods and the models they learned were sets of logic rules, and not the parameters of continuous functions.

For instance, a lot of work from that time focused on learning decision lists and decision trees, the latter of which are best remembered today. The focus on rules probably followed from the realisation of the problems with knowledge acquisition for expert systems, that were the first big success of AI.

You can find examples of machine learning research from those times in the work of researchers like Ryszard Michalski, Ross Quinlan (know for C4.5 and IDR and the first-order inductive learner FOIL), (the) Stuart Russel, Tom Mitchell, and others.