Hacker Newsnew | past | comments | ask | show | jobs | submit | livando's commentslogin

I caught that too. tip of the cap to the owner of that one.


love this post, congrats.


this is a f'ing great idea.


"Multi-threading is a must, when scraping at scale."

I disagree on this point. Starting with a single threaded model allowed my team to scale quickly and with little additional overhead. What we have lost with performance we gained in simplicity and developer productivity. That being said tuning and porting portions of the app to a multi-threaded system is slotted to take place within the next year.

Start with single threaded and simple, move to multi-threaded scrapers when the juice is worth the squeeze.


Or use a language where fully utilizing all CPU cores is transparent, like Elixir? There's zero complexity, you basically add 4-5 lines of code and that's it. Honestly, not exaggerating.

I've done several very amateur scrapers in the last several years, I am never going back to languages with a global interpreter lock, ever.


I'm assuming you're talking about Python, which is also "4-5 lines" to use multithreading or multiprocessing. Can you explain what's wrong with GIL languages?

Now that I think about it, it's even less than 4 lines:

from multiprocess.pool import Pool (or ThreadPool)

pool = Pool()

pool.map(scrape, urls)


When the pooled functions are I/O bound then the GIL is not a problem. Any GIL language will do.

However, for example when generating reports, try use the same instrument for serializing 4 pages of DB records to 4 pieces of a big CSV file, each working on a single CPU core. There the languages without GIL truly shine. And languages like Python and Ruby struggle unless their GIL implementations compromise and yield without waiting for an I/O operation to complete.


I'm not sure you understand how the GIL works in Python. If you're using multiprocessing, there's no locking across the code executing on each core. Also, if you're writing to the same file from four processes, you're going to need locking.


What I have last known is that GIL languages work well in multicore scenarios as long as all N tasks have I/O calls that serve as yielding points for the interpreter, and they do not use preemptive scheduling like the BEAM VM (Erlang, Elixir, LFE, Alpaca) do.

Am I mistaken?


As far as Python goes, yes. Multicore implies multiple processes, which means that each process will have it's own Python interpreter, each with it's own GIL.

If you were to use multithreading instead, you would generally have a problem if you were doing non-I/O work.


Then I think we have a misunderstanding of terms. To me "multicore" == "single process, many threads". Apologies for the confusion.

It seems that now we are both on the same page. Single process & many threads are problematic for GIL languages and that's why I gave up using Ruby for scrapers. GIL languages can work very well for the URL downloading part though.


Any further information on this? Last I looked (which was a while ago), the infrastructure like HTML parsers seemed surprisingly tricky in Elixir.


The only complication is if you want to use Meeseks (https://github.com/mischov/meeseeks) which requires the Rust compiler and runtime be installed because it has native bindings. Meeseks is useful because it's a bit faster than the default Floki (https://github.com/philss/floki) and because it can handle very malformed HTML.

As for Elixir itself, here's a quick example:

```

# Assume this contains 1000 URLs

urls = [....]

# This will utilize 100 threads; if the second parameter is omitted, it will use threads equal to CPU cores. For I/O bound tasks however it's pretty safe to use much more.

results = Task.async_stream(&YourScrapingModule.your_scraping_function/1, max_concurrency: 100)

```

It's honestly that simple in Elixir. For finer grained control the line count is little bigger -- but little. Not hundreds of lines for sure.


Meeseeks's speed difference with Floki is not that significant, and my initial findings are they've leveled out even more with OTP 21, sometimes even swinging in favor of Floki.

The better handling of malformed HTML by default is the much bigger deal.


Thank you man (I know you are the author of Meeseks), I didn't know that. Always knew that the current info was the Meeseks was faster than Floki but it seems that OTP 21 largely eliminated that as you said.

Valuable info, thanks!


It was pretty interesting to see Floki get a lot faster and Meeseeks actually get a little slower with OTP 21. I'll enjoy figuring out why. I hope to get a chance to work on the OTP 21 performance of Meeseeks before too long.

On the plus side there were some nice memory improvements for Meeseeks in OTP 21.


(off-topic alert)

Don't let this sound patronizing because it's not -- but have you looked at how many times is the boundary between the BEAM and the Rust code crossed? I haven't inspected Meeseks' code so can't talk, just wildly guessing.

My ancient experience with Java <-> C++ bridges has taught me that if your higher-level language calls the lower-level language very often then the gains of using the lower-level language almost disappear due to the high overhead of constantly serializing data back and forth.

Anyhow, we should probably take this discussion to ElixirForum and not here. :)

(I am @dimitarvp there and almost everywhere else on the net, HN is one of the very few exceptions of inconsistent username for me).


It makes me happy to see .net developers get nice things.


Ruby 1.9.3-p545 was also released today with the following:

"This release is dedicated to the memory of our best comrade, Jim Weirich. Thank you, Jim. Rest in peace."


awesome, great way to spike out real time behavior in minutes.


With this article I was really hoping for some counter points of how JEE is doing some new and exciting things. Unfortunately, it looks like the only innovation was to throw some of the dead weight off the back of the truck. I can understand how corporate developers (i.e. can't use open source and it needs a support plan) would be happy about the lighter alternative approaches. I just don't see how given the whole technical landscape, that JEE is still a viable front runner.


What? Jee is a open standard, there are open source and closed source implementations of that standard.... The idea being applications are portable between different vendors. If anything it's even more "open source" than most open source frameworks.


Fair review, but I'm still pumped for mine to arrive. I don't mind supporting smart people trying to innovate, and I'm betting the app will be getting better all the time.


I totally agree. WakeMate is way more cost-effective a solution than any of its competitors... And the 'personal sleep analytics' (or whatever) could really help busy people make better sense of the practical ways to improve their sleep habits.

As with all cool things -- esp. cool startups -- a little patience can help. Greg and the WakeMate crew will surely continue to innovate and improve.

One so-so review is exactly that: one review. Plenty of other reviews to come.


This is great news because I was sick of all my development tools being free and good.


I would venture a guess that this is to make more accessible the Android platform to existing ActionScript developers. I can't imagine many Android devs would pay thousands to buy and learn Adobe tools when they can build the same software with the tools they know.


Or not pay thousands and just compile for free using a handful of OS tools that work just as good, perhaps even better. (ie: http://www.adobe.com/products/flex/)


I don't think Adobe's Builder [tool] is free?


No the flex builder isn't free. But the flex sdk is opensource


Flex SDK is open source. You can find it on that page.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: