Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Regular expressions match regular languages (hence the name). If your language involves pairs of things (e.g HTML), it's not regular. Perl hacked support for this in via backreferences and other extensions, but these are slow and illegible. Use a proper context-free grammar parser if you need to parse a context free grammar, you know?

More broadly, people fear and misunderstand regexes because they have no idea how they work. It becomes much easier if you understand how they map to deterministic finite state machines. Recommended reading: https://swtch.com/~rsc/regexp/regexp1.html

Once you understand how they work, you can basically read a regex left to right and intuitively know all the strings they'd match. There is no such thing as an unmaintainable/illegible basic regex - they're just words with some placeholders in them - it's when you cram in extended functionality (which is basically a programming language where all the keywords are single characters) that shit hits the fan.



>Use a proper context-free grammar parser if you need to parse a context free grammar, you know?

The thing is that regular expression are supported as language feature or as standard library in pretty much every language. If you want to build a proper parser, you'll have to jump through a lot more hoops. For instance for C++ I've tried tons of different lexer and parser generators and they all suck for various reasons. (Verbose syntax, uses global variables, C only, lexer not compatible with parser and vice versa,..) Most people seem to end up writing their own parsers from scratch.

The only time I've ever seen parsing done right is with Parsec for Haskell.


Obligatory "You can't parse HTML with Regex" link: http://stackoverflow.com/a/1732454/1935918


I really hate that link. Not only is it uninformative, I disagree with it. If you know the shape of your html data, parsing it with regular expressions can be much easier than with a dedicated parser.


> If you know the shape of your html data, parsing it with regular expressions can be much easier than with a dedicated parser.

If you know the shape of your HTML in advance, you're hardly "parsing" it. :)


You might be interested in OMeta, whose goal is to make language experimentation quicker and more accessible by providing a kind of interface layer between a declarative grammar and a host language. I'm still reading the paper, so I can't vouch for it yet. But it's from VPRI and has Alan Kay's backing.[0]

[0] http://www.tinlizzie.org/ometa/


I really enjoyed learning OMeta, and I'd recommend playing with it to others. However, the performance of the JavaScript implementation is really bad. It uses exceptions as a primary mechanism for control flow, which is not generally well optimized in JS. I observed a toy calculator grammar parsing a string about 20 characters long throw and catch hundreds of exceptions.

I've had good success with Jison as a JS parser generator that is performant enough to feel good about using in production.


It's a mistake to write your own parser because they can get so complex so quickly to even do basic stuff, and you'll lose sight of the original goal of the project. See Prolog - a whole programming language built around the idea of parsing language (albeit NLP)!

I suggest you take alook at Antrl4 for a powerful but easy to use parser plus lexer combo.


I tried Antlr, even bought the book, and at least for C++ I found it pretty unworkable. Even with the book the documentation felt very incomplete. Maybe things work better on the Java side.


Antlr is shit; it will waste your time for anything real. Bison+flex (once you figure out the quirks, undocumented assumptions and flags) and treetop are usable.

https://www.gnu.org/software/bison/manual/

http://flex.sourceforge.net/manual/

http://treetop.rubyforge.org

Disclaimer: A long time ago in a galaxy, far, far away, I wrote an optimizing Java-to-MIPS compiler (sans GC, so leaky heap) in C++ using Flex/Bison and again in Java using JavaCC.


Most apps link against PCRE.


Except nobody uses strict regexes anymore. In particular, without the minimal match (*?) operator, they're of limited utility in real-world situations. It's easy to tell people to use a context-free grammar parser, but the ones that I know of require a lot more setup than a one-liner regex. There's probably an opportunity for someone to really improve the practice of programming by coming up with a compact language to specify and match structures in CFGs, or some subset of them.


JSON-Schema and XML-Schema provide the "specify and match" functionality, in the sense of deciding whether or not a given instance of a document conforms to a defined standard. For simple regex-style queries there are XPath and JSONPath expressions.

If you want to actually touch them as plain objects it's a very straightforward task of implementing a data provider to marshal the objects. In Java this is provided by libraries such as Jackson (JSON) and JAXB (XML). These basically work just like Hibernate.

JSON and XML cover many of the real-world use cases of parsing such CFGs. If you make your data fit into one of those boxes it's very straightforward to validate or marshal them according to those schema protocols. They obviously do have a greater complexity than just writing a regex, but that's kind of the nature of using a more expressive language, and it's by no means an insurmountable increase.


Minimal match is still OK, that will only affect disambiguation and won't allow you to write a pattern that does not denote a regular language.


"There is no such thing as an unmaintainable/illegible basic regex..."

To someone who knows BRE. I am one of those people. It's ERE and PCRE I do not understand very well.

Sharing solutions to common problems using BRE on HN always seems to trigger (unwarranted) criticism using either of the exact words you mention, or synonyms for them. "Unmaintainable" (by who?). "Illegible" (to who?).

I "maintain" 100's of BRE scripts. They are perfecty legible to me. None of them are so complex I cannot re-write them in a short time. It is the structure of the input that is complex and which takes time to recall.

I also use lex, a common utility found on almost all UNIX derived OS; this article seems to ignore that option. I like to think it's faster than Perl or Python, but I cannot say for sure.


> I like to think it's faster than Perl or Python, but I cannot say for sure.

It is. I implemented an assembly language parser with pyparsing. It worked okay but the function call overhead with a combinator-based parser handling both the lexing and grammar was murder. I replaced it with regexes and got a 6x speedup. Not something I would do with a complex grammar though. Native code would obviously blow this away in speed but it is fast enough now.



I use regexes all the time for parsing data on a variety of 3rd-party websites. Just because they aren't perfect at matching every potential theoretical situation doesn't mean they shouldn't be used. In practice, regexes can be a simple and reliable way to grab data out of HTML. Don't dismiss them out of hand!

Another point, generally overlooked by the theoretical purists, is that HTML in the wild is rarely correct, and your perfect HTML parser will barf when trying to process it. Regexes on the other hand don't have to care about exact syntax and can cope with horribly mangled data.


An ordinary regex obviously can't parse html, because html is not regular (given nested elements and the pumping lemma). But what you can easily do with a regex is to tokenize html - extract anything that looks like a start tag, for example. The simply approach will obviously get some things wrong - a tag inside a comment is meaningless, for example - but for a lot of uses, this difference simply isn't important.

If the job is to extract all links from a webpage, regexes will do just fine, and will probably be easier to write and understand than alternate approaches. (This is absolutely not a fair comparison of course; you could compare writing a regex engine to writing a html parser. But I digress.)

If the job is to determine whether a given webpage is a member of the set that includes all valid html documents. then a regex is not sufficient.

If the job is to extract a list of syntax tokens from a webpage, a regex is likely fine.

If the job is to assign semantic meaning to every token in that list, a regex just won't work.

Either way, the point is to know what you're doing. Much "parsing" of webpages is not parsing in the formal language sense, and who cares that it isn't because it doesn't need to be.


No, tasks like "extracting all links from a webpage" are absolutely trivial using an HTML parser. Run the following XPath query on an HTML object:

  //a[@class='specified_string']/@href 
Yes, you have to understand the syntax XPath to write such expressions, just like you have to know the language of regexes. Or at least be able to google them.

The answer to "who cares" is "you", because you're the one who's going to catch hell when your regex failed to capture some hyperlink that utilized some feature of XML that exceeded your test-cases. The one-liner above is guaranteed to Just Work on all valid XML documents, so why even create such a monstrosity?

Everyone knows that Regexes Cannot Parse HTML, and yet people still try it because they think they're smarter than Noam Chompsky. The real truth is that everything looks like a nail to these people, because all they have is a hammer.


Because valid documents are rarer than documents where regex parsing is good enough.


Then you're not talking about parsing HTML/XML, are you? How could you possibly know which links or syntax tokens are actually going to be displayed on a page if you feed the browser's parser an invalid document?

There are fault-tolerant HTML parsers like TagSoup that are specifically designed to handle dirty HTML and spit out a valid document object. If you have sources that are malformed badly enough that it's still not working, you can define custom SAX properties to handle them. But a task like that is certainly a best-case effort and the interpretation of such a library is no more or less valid than the interpretation of the browser's parser. It's not a valid document to start with and nothing can make it so.

If you are only parsing values out of a single specific data template, you know it's not going to parse as HTML or XML, you know that it's never going to contain weird values, and you know it's never going to change - then go hog wild. But it's fundamentally a brittle approach that only holds as long as those assumptions do. I've made the mistake of believing some of those about my data and it's bitten me before. And I really question the implicit assertion that "most html parsing" would fall into that exceedingly narrow category. Especially after a couple years of feature creep.

Just keep your logic general and normalize your data. Offer a failover to a fault-tolerant parser in your data layer with a logged warning. This is much more durable and doesn't silently generate invalid tokens or silently fail to capture valid tokens. Regexes simply cannot offer the capability to fail loudly. So once you are no longer actively babysitting your custom regex parser it could have started failing at any time - how would you even know?


Good HTML Parsers typically use engines of browsers and will actually match websites very well. They'll let you use selectors and not regex and are a lot more accurate.

It also won't break if the website adds a single attribute or a quirky value.


Nor will regexes...


It is not a question of "theoretical purity", it is a fact that it is simply not possible in the real world to parse a html document using only a regex. But this is often misunderstood as saying they regex shouldn't be used in parsing. Regex'es are quite useful as a part of a parsing process, eg. to locate the tags and attributes and separate them from text. This is the scanning or tokenization step. But after this, you need to create a hierarchical data model corresponding the DOM, and you need to do that outside of a regex. A regex outputs only a string or flat list of strings (in all regex libraries I know of), it does not output a hierarchical data structure, so it follows that you need something more than a single regex to parse a hierarchical data structure like a programming language or a html document.

Only exception would be if you need to extract very specific data items which are not hierarchical in nature. Eg. you want to extract only the specified character encoding or something like that.

Your point about the "perfect HTML parser" is kind of missing the point, since regexes does not magically solve the problem of imperfect HTML either. A perfect HTML parser would be one which implement HTML spec fully, including the error recovery rules. These are quite complex, eg, if you reach a <p> then an open <h1> is implicitly closed, but an open <i> is not. Try implementing this logic in a single regex! ("Regexes does not care about perfect syntax" - what does that even mean? A regex match exactly what you tell it to match, just like a parser parses what you program it to parse.)


I guess I must be doing something impossible then. I've been happily screen-scraping many different websites with regexes over the last ten years. The data I gather is critical to my work. And yet, regexes are fine.

Your viewpoint is the exact problem I'm complaining about. You think what I am doing is impossible. How can you be so certain? That's the sign of someone who is too tied up in getting something perfect.

In fact, while I've hit many problems in gathering data over those years, I can't think of any problem that has been because of regex limitations. And none of my regexes are overly complicated or rely on obscure extensions. Some sites get redesigned and I have to adjust my code, but those redesigns would break any kind of parsing as the pages got completely redesigned (and the URLs and site structure often change as part of this...)

If you need to gather data from a website, your real problems will be in the networking and reliability side of things.


It's possible to a certain degree, but it's just way way harder then you'd do if you used a proper parser. For example, let's say I want to get all the recent question links off http://stackoverflow.com/questions/new , With RegExp, I can write a clever thing that parses `<a href="(.?*)" class="question-link"` and _hope_ that the order never changes and the class never comes before the href and that they don't change that in a design or in other pages and that the href does not contain `class="question-link` inside it and to a degree that's valid.

The thing is - the alternative is to write a query selector - which is another more suitable domain specific language for making selections - only on the DOM instead of text, I'd just write `$(".question-link").get().map(x => x.href)` to get the hrefs and I know it's __always perfectly safe__. Now that example is trivial, if I only want links where the questions are tagged with C#, I get a much harder problem with Regex, but with query selectors it's still mostly trivial.

So, it's not that it's particularly hard to use regular expressions to solve it, it's just a lot harder than the alternative which is super simple and obviously correct.


Your method relies on the class names not being renamed (e.g. I see "question-hyperlink" on /questions, not "question-link"). I'd skip the class entirely and match on the URL, since I doubt stackoverflow want their URLs to break:

  /<a[^>]*href="(\/questions\/\d+[^"]+)"/i
But... we can go back and forwards posting examples and find fault in any regex that I post or any selector that you post. It's missing the point. Both methods are at the mercy of web page redesigns. Both methods can be made more robust against certain changes, but cannot survive other changes. You are trying to say that regexes won't work. I am saying that both methods work.


Look, you have essentially two options here: 1) add a full featured DOM parser to your program - do that if you have to understand the DOM; 2) write regex - do that if you don't care about the DOM, if you know how the server formats the data you need and if the server formats the data you need in a reproducible way.

> (...) and that they don't change that in a design or in other pages (...)

Well if they change the design of the pages then you will have to rewrite your regex accordingly in order to find the data you need. But if that happens, odds are high that your program, which uses a full featured DOM parser, will have to be rewritten as well in order to handle the modified output of the DOM parser...


in fact, in practice, i find that scrapers using regexp matchers are more robust against the kinds of template changes that people make than scrapers based on the html tree structure.


You're probably using some extension and not a "strict" regular expression engine. Regular expressions describe regular languages, which are level 3 in the Chomsky hierarchy https://en.wikipedia.org/wiki/Chomsky_hierarchy and they formally, provably, do not have the expressive power to describe HTML. This has already been posted in this thread, but make sure to read Larry Wall's quote in the second answer: http://stackoverflow.com/questions/6751105/why-its-not-possi...


Most (all?) common languages use extended variety regex engines.

It is really weird seeing smart people talking about this issue in the no regex works camp.

I think it's probably because they haven't had to grab a few specific data points from websites ever.


Possibly there is some confusion about the word "parsing". If it is just the question of scanning for some substring or pattern on a webpage, sure you can do that with a regex. But this is not what is usually considered parsing.


Right. And that confusion is probably fueled by the existence of smart (invalid html swallowing) parsers mainly used in scraping, like beautiful soup and nokogiri.


Maybe you are indeed doing the impossible, but I consider it more likely that you either (a) is using some parsing logic (like a stack, recursion or some such) intermingled with the pure regexes, or (b) is not extracting hierarchically structured data from the html.


Could you please stop insisting that I'm doing something impossible? Do you think I am imagining my work? I assure you that no backtracking or recursion are involved. Stacks? Well, sometimes I extract bits of a page with one regex and process it more with further regexes in another function, so I guess you could argue that the language uses a stack for a function call, but still... please stop grasping.

I'm very sure that you could invent some website that relied on some hideous deeply recursive complicated structure, and it would be painful to grab data from it. In my experience, those cases are extremely unlikely. If you let these extremely unlikely situations stop you from picking a simple and useful solution for all the other cases, you are indeed getting tied up in trying to be perfect.


I'm not suggesting that you code doesn't work, if that is what your thinking! It's just a terminology thing.

What you describe ("sometimes I extract bits of a page with one regex and process it more with further regexes in another function") sounds like a recursive descent parser, which is a very common way to implement a simple and fast parser.

I'm not suggesting you can't parse html with the help of regexes, I'm just stating that you cannot parse html only with a regex: You cannot apply a single regex to the html and get something similar to a DOM out. You need some kind of stack or recursive programming logic in order to extract a recursive or hierarchical data structure from a string.


But what's the comparative advantage over a faster, easier to comprehend, and more correct html parser?


I think the word parsing is being used in some cases that are better described as web scraping.

If you are just scraping a little data, a regex can be a quick and easy way to get what you need and get on with your life.

If you actually care about the html structure, regexes are not sufficient. They are still useful for matching tokens though.


Not sure I understand the relevance of this comment to the post in question. Unless your objection is that using capture groups in this way is an example of 'cramming in extended functionality'?

On the other hand, this post has spawned the usual regex thread including a zalgo link and a pointer to a regex that finds prime numbers, so I guess you've done your part to perpetuate regex mythology.


That's too optimistic. Here's a challenge for you: write a (grep) regular expression that checks a decimal number for divisibility by 7.

That's totally a regular language, but I doubt you'll find a legible way to express it.


If the number N can be converted to a string of length N, this regex checks if it's divisible by 7:

    ^(.......)+$


That's not a solution. I asked for a something that checks decimal numbers. A real solution looks uglier: check out https://github.com/matthiasgoergens/Div7, it starts

((((((6(2|9)8|6(2|9)1|5)(5(2|9)8|5(2|9)1|4)(5(2|9)7 [...]

(Unless you can convert from decimal to unary in regex, too. I doubt you can, regular languages are too weak for that.)


That would work, use base 1 representation.



That's not a regular language, like the grandfather comment by wonnage talked about.


What does that have to do with the language syntax..


Regexes we use are an extended form of so-called "regular expressions", which describe regular languages and finite automata, and in that part of maths, a language that accepts binary numbers divisible by N is a very common toy problem.


I agree with the general gist of your post, but I should point out that all finite languages are regular. So while potentially infinite HTML documents cannot be parsed by regular expressions, they don't turn up very often.


You're missing the distinction between a finite language, and an arbitrarily large finite production of an infinite language.

No-one cares about "infinite HTML documents". I don't even think the Chomskyan hierarchy concerns itself with languages with "infinite" productions. All you have to worry about is infinite languages -- i.e., languages with arbitrarily large productions.

There's a key difference between "infinite" and "arbitrarily large": the latter is quantifiable. While indeed to can build a regular expression to match any finite subset of HTML, it can only match HTML documents up to some fixed size. I can always give you a (finite!) document that is one tag deeper that your regex will choke on.

"But recursive parsers have the same issue!" you say. "Their stack will run out of memory at some point!" Yes, but they have a key difference: the amount of stack (memory) they require is bounded by the size of the document. This is not true for a regular expression! In fact, not only would a regular expression to match a given subset of HTML require memory exponentially proportional† to the size of the document, the automata itself would be similarly massive!

I really wish someone came up with and promulgated a concise handy built-in ubiquitious equivalent of regular expressions for, say, PEGs. The closest I've seen are DCGs in Prolog. Would make so many parsing problems more easy to do correctly!

† It's possible I'm wrong about this since it's early morning and I'm basing this off my intuition rather than a proof. The part about the automata itself being exponential w/r/t the size of the document is definitely true though.


You can't always give me a larger HTML document - eventually you will run out of usable universe. Yes, I should have said 'bounded by some finite maximum size' rather than just finite, but it's not a great leap to see that in reality all documents will be bounded by such a maximum.

Finite state automata do indeed need an exponentially larger number of states compared to a pushdown automata, I made no claim as to efficiency. The point remains - for all practical purposes, you can consider all languages to be regular and using a stack is merely an optimisation.


For practical purposes, expressing your language using an exponentially large number of states is, for the very reasons you state, untenable.


Sure, but that doesn't change the fact that you cannot implement anything at all that accepts more than a regular language. In practice, however you implement it, you will have only implemented something equivalent to some finite state machine (due to the finite memory available to you).


> The point remains - for all practical purposes, you can consider all languages to be regular and using a stack is merely an optimisation.

This is thoroughly wrong. For “all practical purposes”, you won’t expand a non-regular language into a giant regular one with an emulated state.


You are confusing regular language with finite state machine. I don't know why there is so much resistance to this. All real machines have bounded memory available to them, thus they can only accept regular languages. Therefore, regular expressions are as powerful as any machine in existence.


Thinking about this some more, there is no reason that a large number of states has to have a proportional amount of memory. If the states are represented in binary notation (thus needing logarithmic number of bits) and the state transition function is represented as a binary decision diagram then this could be quite compressed indeed.


Interesting idea! Wait, here’s another: maybe you could take your regular grammar that accepts some large number of things that a context-free grammar it’s emulating would accept, and implement it using a stack or something. That would save space and remove the arbitrary restriction.


Well you can keep picking nits about implementation choices, but unless your stack can grow unbounded (because you have unlimited resources) then you haven't implemented a PDA and haven't removed any restriction at all. You've just optimised space usage.

If you really believe that regular expressions cannot parse any HTML document in reality (eg given that all web browsers in practice limit the nesting depth of HTML) then please present some evidence.


> If you really believe that regular expressions cannot parse any HTML document in reality (eg given that all web browsers in practice limit the nesting depth of HTML) then please present some evidence.

You can build a regular expression to match any HTML document to any fixed depth. Set that to whatever you think “all web browsers” limit HTML nesting to “in practice” – citation very much needed, I don’t believe they do – and voilà! You have produced something absolutely useless and probably several million characters long.

I don’t know what you’re arguing. I don’t think you know what you’re arguing either. It’s pointless to continue talking.


For example, see this old commit setting the WebKit maximum nesting depth to 2048. I am told it is 512 by default today. https://www.mail-archive.com/[email protected]... I'm fairly certain all browsers will impose such a limit or risk blowing their stack and crashing. Turns out there are advantages to setting upper bounds after all. Who'da thunk?

Regarding the BDD approach, it looks like somebody already implements it with excellent performance and memory characteristics: http://www.cs.rutgers.edu/~vinodg/papers/raid2010/raid2010_s...

Of course, the point (if you read back) was not to say that you should use REs for all parsing. Just merely to correct a commonly repeated mistake that 'REs cannot parse HTML'. They can do so just fine.


Ah, okay. That’s much more straightforward. The answer, though, remains that regular expressions that are actually regular cannot parse HTML that is actually HTML. Regular expressions can parse HTML up to a certain depth specified in advance – which is not the definition of HTML.


Now we are definitely going in circles.


IOW, study the implementation to understand the interface.


More like, know what the notation is a notation of to understand the notation.


When quoting the article please put quotes around the text, or use the `>` character for each line you quote. Thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: