Online graduate-level machine learning course from CMU's Tom Mitchell

monk_the_dog · on Nov 5, 2011

I'm enrolled in the online Applied ML class from Stanford, and I've also been watching this course from CMU (I'm up to the Graphical Model 4 lecture - almost the midterm). If you've taken at least one stats class you'll get much more out of CMU's class.

BTW, here are some good online resources for machine learning:

* The Elements of Statistical Learning (free pdf book): http://www-stat.stanford.edu/~tibs/ElemStatLearn/

* Information Theory, Inference, and Learning Algorithms (free pdf book): http://www.inference.phy.cam.ac.uk/mackay/itila/

* Videos from Autumn School 2006: Machine Learning over Text and Images: http://videolectures.net/mlas06_pittsburgh/

* Bonus link. An Empirical Comparison of Supervised Learning Algorithms (pdf paper): http://www.cs.cornell.edu/~caruana/ctp/ct.papers/caruana.icm... (Note the top 3 are tree ensembles, then SVM, ANN, KNN. Yes, I know there is no 'best' classifier.)

zeratul · on Nov 5, 2011

About the bonus link:

It does not make sense to compare ensamble methods (bagging & boosting) with single instance classifiers. In practice, you try all classifiers and then you use best to create an ensamble. The paper leaves me unsatisfied, thinking that probably bagging or boosting SVM would give the best results.

lliiffee · on Nov 6, 2011

I don't see why not. Different classifiers have different bias/variance characteristics. If you want to increase variance and decrease bias, than boost your classifier. (This is why boosting is usually applied to simple classifiers.) But whether that will actually help depends on the characteristics of the problem and the classifier used.

I guess bagging is a different story. So far as I know bagging usually decreases variance with no bias penalty, so it is more a trade-off between variance and speed.

law · on Nov 5, 2011

It's actually fine to compare an ensemble method (using weak base learners) to a single instance strong learner. In this way, you compare the benefits of combining the weak learners with the benefits of using a single classifier. I see where you're going with that, but comparing ensemble methods with a single classifier is often a useful measurement.

zeratul · on Nov 5, 2011

That's true when experiments are design to show gain in performance due to some aggregation technique. The mentioned article achieved that only for DT and the body of the article doesn't seem to focus on the effects of ensemble methods.

monk_the_dog · on Nov 5, 2011

You make a good point. Ensemble methods seem to outperform single classifiers. There's no reason you can't have an ensemble of SVMs. The paper should have included something other than an ensemble of trees.

I tried to find a paper comparing an ensemble of SVM to an ensemble of trees and I came up empty (after a quick search). I did find papers showing ensembles of SVMs outperforming a single SVM. I also found a comment on a paper claiming an ensemble of trees out outperformed a "Parallel Mixture of SVM" (see here: http://www.mitpressjournals.org/doi/abs/10.1162/089976604323...). Of course, that's not a great source.

I absolutely agree they should have included ensembles other than trees. I don't necessarily agree an ensemble of SVM would have beat an ensemble of trees. It would have been interesting to see.

zeratul · on Nov 5, 2011

There was a suicide note emotion classification challenge:

http://computationalmedicine.org/home-0

Very noisy and sparse data. 25 teams. 22 system description papers. The winner used SVM ensamble.

monk_the_dog · on Nov 5, 2011

Zeratul, you're obviously into ML. Would you mind if I asked what your application is? I'm just curious.

I work in computer vision. When I do a machine learning problem, I spend most of my time brainstorming and implementing good features. I'm getting deeper into ML (and loving it). I'm always curious what other people are doing with ML.

zeratul · on Nov 5, 2011

Medical language processing, information extraction from patient data, text classification, and clustering.

Yes, it would be great to get a list of hackers that do ML and the domain that they are working with.

aperrien · on Nov 6, 2011

I'm working using ML in the casino industry. I use multiple forms of classification and forecasting.

monk_the_dog · on Nov 6, 2011

Once upon a time I thought about using ml/vision in slot machines. I would try to read the gamblers emotions/age/sex and the slot machine would change stimulation (music/lights etc; not mess with the odds) to try to keep them at the machine longer.

I thought it was a good idea until I actually visited a casino. People sit at the slots in what looks like a hypnotic state. The emotions don't change much. I don't think I could have made a measurable difference.

I'm not surprised the gambling industry is using ml, but cool to hear about it. Thanks.

bhickey · on Nov 5, 2011

To your list, I'd like to add Jaynes's 'Probability Theory' A few chapters are freely available here: www-stat.wharton.upenn.edu/~steele/Publications/PDF/PT.pdf

(The publisher asked the book's editor to stop distributing the whole PDF.)

shriphani · on Nov 5, 2011

I made the mistake of enrolling in a graduate level ML class without a strong foundation in statistics - my transcript is now going to be defaced permanently. But thanks for the inference text - is there an OCW version of an inference course?

danso · on Nov 5, 2011

I love it when people link to freely available academic texts, thank you.

Here's another one from Stanford: Mining of Massive Datasets http://infolab.stanford.edu/~ullman/mmds.html

monk_the_dog · on Nov 5, 2011

I just took a quick look on the chapter on clustering. Looks good! I'll put it on the ever growing stack. Thanks!

drats · on Nov 5, 2011

Silverlight? Are these people serious? Whether you are an educational institution or a for-profit media company, you are trying to get to the largest number of people and cause them the fewest problems. Silverlight fails spectacularly at both those objectives.

Edit: I know there seems to be a flash player component as well, but it's failing for me and can't get to the .mp4. Which doesn't speak well of the joker who cobbled the site together either.

SkyMarshal · on Nov 5, 2011

Especially when the target audience for such a class is probably likely to have an outsized portion of *nix users.

zeratul · on Nov 5, 2011

Stanford also uses Silverlight and Flash:

http://171.64.93.201/ClassX/system/users/web/pg/view_subject...

Maybe now it's considered as a "distant learning standard"?

amirmc · on Nov 5, 2011

"To view a video you will have to login with your CMU Andrew username and password, ..."

Also, requires Silverlight (which I don't fancy installing)

Edit: This is the Tom Mitchell that Andrew Ng refers to early on in the Stanford ML lectures (when defining Machine Learning)

ya3r · on Nov 5, 2011

You don't have to login to watch videos.

He is the author of one of the must used texts on machine learning: "Machine Learning, Tom Mitchell, McGraw Hill, 1997."

Maven911 · on Nov 5, 2011

I hope this question doesnt come off as too new naive but due to the amount of links on the front page about ML - what is so fascinating about ML?? Why is there not the same level of interest/links on topics such as cryptology, graphics, circuits, comp architecture ?

law · on Nov 5, 2011

There's this enormous focus on 'web scale' technologies. This focus necessarily invokes visualizing and making sense of terabytes and eventually even petabytes of data; conventional approaches would take thousands or millions of man hours to accomplish the same level of analysis that computers can perform in hours or days.

Tom Mitchell's definition of machine learning algorithms as those that improve their performance at some task with experience is precisely the way in which humans go about learning what's necessary to perform the same tasks that formerly took thousands or millions of hours.

For highly dimensional problems, such as text classification (i.e., spam detection) or image classification (i.e., facial detection), it's almost impossible to hard code an algorithm to accomplish its goal without using machine learning. It's much easier to use a binary spam/not spam or face/not face labeling system that, given the attributes of the example, can learn which attributes beget that specific label. In other words, it's much easier for a learning system to determine what variables are important in the ultimate classification than trying to model the "true" function that gives rise to the labeling.

tapertaper · on Nov 6, 2011

Great comment.

Probably also worth speculating on why this is happening NOW. Why is this breaking out of CS departments in 2011 and not 2002?

The datasets are new.

Bandwidth? Storage capacity? Computing power? All of the above?

law · on Nov 6, 2011

Actually, this has been actively researched since ICs started gaining widespread usage in the 1970s! Even before that there were plenty of journal papers produced that deal with the basics of ML and AI.

It wasn't until the 1990s that computers started becoming reasonably priced and more accessible to researchers and hobbyists that we began seeing an exponential growth in the amount of research output. In many way, one could argue that the proliferation and development of AI has very much followed Moore's law, since these are extremely complex and costly calculations.

Bandwidth increases have certainly increased the availability of data sets (Google has its entire ngrams data set fully available, and it's multiple terabytes in size), but storage capacity (hard disk, RAM, and CPU cache) and computing power have really formed the bottle neck. It's not just storage capacity, either: I/O read/write times are also immensely important. It's all just a huge balancing act right now.

zeratul · on Nov 5, 2011

There was a comment here from SandB0x saying that ML has great potential for startups. ML algorithms have many practical applications and avenues that are commercializable.

gms · on Nov 5, 2011

Because web companies are starting to use ML nowadays, and most of HN's users are associated with the web-company crowd.

kky · on Nov 5, 2011

I love that open source mentality (sharing and collaborating for the love of the work, community, and result) is reaching higher ed. I can't wait for it to reach lower ed! If kids start seeing this model at a young age...

ya3r · on Nov 5, 2011

As Tom Mitchell says on the first video, this course is recommended for Phd students.

igrekel · on Nov 5, 2011

Cool. I'm disappointed that there isn't a video for hidden markov models and other models for time series tough, just slides. The schedule says that session is in march, maybe by then there will be a video online.

zeratul · on Nov 5, 2011

Three most important issues in ML are missing for this course:

* Feature selection, Overfitting, Bias-Variance tradeoff

Maybe one of the prof Mitchell's students can make the missing slides available online?

law · on Nov 6, 2011

If I'm not mistaken, that was just a recitation that replaced the regular Thursday class. It was one of the TAs covering that stuff briefly. All three topics were covered by Tom Mitchell in previous classes.