Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You're probably using some extension and not a "strict" regular expression engine. Regular expressions describe regular languages, which are level 3 in the Chomsky hierarchy https://en.wikipedia.org/wiki/Chomsky_hierarchy and they formally, provably, do not have the expressive power to describe HTML. This has already been posted in this thread, but make sure to read Larry Wall's quote in the second answer: http://stackoverflow.com/questions/6751105/why-its-not-possi...


Most (all?) common languages use extended variety regex engines.

It is really weird seeing smart people talking about this issue in the no regex works camp.

I think it's probably because they haven't had to grab a few specific data points from websites ever.


Possibly there is some confusion about the word "parsing". If it is just the question of scanning for some substring or pattern on a webpage, sure you can do that with a regex. But this is not what is usually considered parsing.


Right. And that confusion is probably fueled by the existence of smart (invalid html swallowing) parsers mainly used in scraping, like beautiful soup and nokogiri.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: