I love HTML scraping. But Javascript???...The juiciest data sets these days are ...

freshhawk · on Dec 9, 2012

Phantom.js and casper.js

If you can't get the data from the endpoints the javascript hits then write your scraper in javascript and have it run in a headless browser, and it's the webkit engine so most sites test their site against it heavily.

Either pull the data out of the javascript objects or trigger your extraction from the html by attaching to the events in the javascript.

frabcus · on Dec 9, 2012

I've started playing with zombie.js recently as well - much lighter and faster than the ones that instrument a completely full browser. But has a full Javascript engine.

jonpaul · on Dec 9, 2012

zombie.js is not a full browser. It's a poor emulation using jsdom as its backing. http://zombie.labnotes.org/guts Beware, for some applications, jsdom is super buggy.

freshhawk · on Dec 9, 2012

That's really interesting, thanks.

I worry that it's not going to replicate a real browser accurately enough, but I'm excited to try it out a bit.

jonpaul · on Dec 9, 2012

Your worry is correct. http://news.ycombinator.com/item?id=4896054 I've tried scraping with it, and it failed miserably on some sites.

frabcus · on Dec 10, 2012

Yeah, it's not mature enough yet.

We're also trying it for integration tests, as it is much quicker than Phantom or Selenium. Even there, where we control the standards-compliant site, it isn't quite good enough yet.

Would love to see more people helping make it so, though!

suldan34 · on Dec 9, 2012

upvote for casperjs - it's definitely the best system I've come across for scraping javascript / ajax contents.

level09 · on Dec 9, 2012

That is actually very simple, and you can even use a headless browser to execute javascript:

first install Xvfb and pyvirtualdisplay then try this snippet https://gist.github.com/4243582

selenium is great, it can even wait for ajax requests to finish (see WebDriverWait) ..

kanzure · on Dec 9, 2012

> first install Xvfb and pyvirtualdisplay

You really don't need xvfb anymore. Use xserver-xorg-video-dummy.

RaSoJo · on Dec 9, 2012

Yippie...that worked. Thanks a ton for sharing that code snippet. Am unfortunately more of an analyst. Have only just started picking up coding for data analysis. JS scraping was one area that I always had difficulty with. Not anymore :D

bdcravens · on Dec 9, 2012

See my other comment :-)

I've been doing this on a site that is 100% Javascript-driven for over a year, very successfully.

It's really no different than hitting a static site with Selenium. Figuring out the proper XPath to use is often the biggest challenge: Chrome Developer tools help immensely. Also, you need to watch for delays in JS rendering, so put a lot of pauses in your scripts.

It's of course slow, so if you want to distribute it across several machines, use Selenium Grid or a queue system (SQS, Resque, etc). Setup Xvfb to run on headless Linux instances.

NoahSussman · on Dec 9, 2012

In general, any Web functional testing tool can be used as a scraper. Scraping and testing are extremely similar. In both cases, one uses XPath or (hopefully) CSS to locate an element and examine certain aspects of that element's state. A scraper is only different from a functional test in that a scraper is focused only on the state of nodes (potentially) containing human-readable content. That, and a scraper saves the data it collects rather than discarding all data at the end of a test run.

Here's a very old Selenium 1.0 example that scrapes the full, rendered HTML of a page. After performing a scrape like this, I would then feed the HTML into a parser such as Nokogiri http://snipplr.com/view/7906/rendered-wget-with-selenium/

kybernetyk · on Dec 9, 2012

Here's a little experiment with Reddit-automation using Selenium: https://github.com/jsz/reddit_voting

RaSoJo · on Dec 9, 2012

awesome...thanks :)

alexmic · on Dec 9, 2012

If you are using Python, you can also use pyv8 to evaluate Javascript code.

chewxy · on Dec 9, 2012

There is also Ghost.py[0]

[0]http://jeanphix.me/Ghost.py/

If you are planning to use phantomjs, import sh and it's commandline all the way to payday :D

kanzure · on Dec 9, 2012

Yes, but if you want the DOM you would have to use something like webkit. So something like pyphantomjs might hit the right spot. It's a python re-implementation of phantomjs.

https://github.com/kanzure/pyphantomjs

RaSoJo · on Dec 9, 2012

indeed i do use Python. thanks for sharing...it appears most interesting. Have started playing with it...