but, to be fair, you've obviously had lots of experience, and understand a great deal more about formal languages and how they translate to something that is trivially parsed, than the majority of people who will read this article. should they attempt a language, they'll likely fail to produce a working parser long before they spendthe rest of the hours on it.
If you can not trivially implement a parser, you probably shouldn't be implementing your own language. The problems only get worse from there.
I'm egalitarian inasmuch as I believe every serious programmer ought to implement some sort of toy language at some point, but I'm not so stupid as to think that this is a good idea at all phases of a programmer's development. Beginners should concentrate on other basic tasks, even low-intermediate really should too. I wouldn't reserve this task for "experts" though, because this is one of the big steps in moving from intermediate to expert. (Anyone who has assembled the skill set to implement a toy-but-nontrivial language has assembled the skill set to accomplish a very wide variety of programming tasks. If I were interviewing someone and they could demonstrate this, I would almost entirely cease to care what actual languages or frameworks they may have worked in.)
I agree in principle, but that won't stop someone from fighting with shift reduce conflicts or dangling elses for hours on end before giving up. constructing an unambiguous CFG isn't trivial without experience, and it's tough to get that without bashing your head a few times.
> someone from fighting with shift reduce conflicts or dangling elses for hours on end before giving up.
Right, but if you just hand-write a recursive descent parser, you won't have to deal with shift-reduce conflicts. Dangling elses are trivial to solve. The nice thing about hand-writing a parser is that it lets you learn one new thing (how to implement a parser for a grammar) while taking advantage of what you already know (how to write, run, test, and debug code in some existing language).
Throwing a parser generator at someone means they end up learning the weird vagaries of that generator instead of focusing on their own language. Meanwhile, the resulting generated code is nigh-unreadable, so all of those debugging skills and nice IDE they have go to waste.
Good point -- also, dealing with corner cases, writing tests, and maintaining the parser as the language evolves are a couple more things that might take a lot more work.
Do you know of any strategies for error reporting, or of tools that implement? I'm always on the lookout for cleaner approaches to this.
> dealing with corner cases, writing tests, and maintaining the parser as the language evolves are a couple more things that might take a lot more work.
Not really.
Seriously, if you're going to get bogged down trying to get the lexer/parser to work, you're not ready to work on a full blown language/compiler. Lexing/parsing is the EASY part, as in a minute, insignificant part of the time you'll invest in the project. I really do mean that.
Wait a second. Please don't misrepresent what I'm saying. I never said anything about "get[ting] bogged down trying to get the lexer/parser to work". I never mentioned a full-blown language/compiler.
I'm genuinely interested by your comment. If I'm interpreting it correctly, you seem to be claiming that ensuring corner cases are handled correctly, testing, and maintaining a parser require a trivially small amount of work. What do you use to build your parsers?
Using a coverage analyzer is adequate for evaluating the thoroughness of the tests.
Yes, it all is a trivially small amount of work compared to the rest of a language project (and even compared with the rest of the compiler). You'll spend much more time just trying to figure out how to convert floating point values to strings.
Just to clarify: I'm asking what strategy/approach you use -- hand-written vs. generator, bottom-up vs. top-down, backtracking, ambiguous, context-sensitive, semantic actions, etc. Sorry for the confusion.
I might actually do that. I see that people are persistent in believing it's hard, maybe after making these claims I have an obligation to back them up.
I think there are a few things worth mentioning here.
One is that there is a big difference between writing a parser for a language you are inventing and writing a parser that is attempting to implement an existing language. Languages have lots of corner cases; if you are inventing the language then every quirk of your parser is correct by definition. You might not even be aware of some of the subtle choices that your hand-written parser is making.
As an example, of this, it was not discovered that ALGOL 60 had a "dangling else" ambiguity in its grammar until it after had been implemented, used extensively, and even published in a technical report. It was essentially an accident of the implementation that it resolved the ambiguity in the way that it did. So while it might not be too much work to "get the lexer/parser to work", it doesn't follow that all of the issues around parsing are trivial. There is still a lot of complexity and subtlety around parsing if you're trying to design something that could reasonably have multiple interoperating implementations.
Secondly, there is a very very seriously wide variation of lexical/syntactic complexity between languages. You can pretty easily write a 100% correct JSON parser in an hour or less (possibly much less, depending on what language you choose to write it in). On the other hand, it takes man-years to write a 100% correct C++ parser (not least because C++ tightly couples parsing and semantic analysis). Now I know this article is more talking about designing your own language, and no language will start out as syntactically complicated as C++, but empirically most of the languages we actually use have a fair bit of complexity to them, so delivering the lesson that lexers/parsers are easy in general is, I think, the wrong message to be sending.
I had two options: incremental parsing, or asynchrous
parsing. Clearly, since I'm a badass programmer who can't
recognize my own incompetence, I chose to do incremental
parsing. I mentioned this plan a few months ago to Brendan
Eich, who said: "Let me know how the incremental parsing
goes." Brendan is an amazingly polite guy, so at the time I
didn't realize this was a code-phrase for: "Let me know when
you give up on it, loser."
The basic idea behind incremental parsing (at least, my
version of it) was that I already have these little
functions that know how to parse functions, statements,
try-statements, for-statements, expressions,
plus-expressions, and so on down the line. That's how a
recursive-descent parser works. So I figured I'd use
heuristics to back up to some coarse level of granularity —
say, the current enclosing function – and parse exactly one
function. Then I'd splice the generated syntax tree fragment
into my main AST, and go through all the function's siblings
and update their start-positions.
Seems easy enough, right? Especially since I wasn't doing
full-blown incremental parsing: I was just doing it at the
function level. Well, it's not easy. It's "nontrivial", a
word they use in academia whenever they're talking about the
Halting Problem or problems of equivalent decidability.
Actually it's quite doable, but it's a huge amount of work
that I finally gave up on after a couple of weeks of effort.
There are just too many edge-cases to worry about. And I had
this nagging fear that even if I got it working, it would
totally break down if you had a 5,000 line function, so I
was kinda wasting my time anyway.
All of this is to say: I can't argue with your basic point that "getting lexing/parsing to work" for a language you are inventing isn't terribly difficult. But I disagree with your larger (somewhat implied) point that parsers as a whole are easy.
A couple points: 1. the dangling else problem is trivial to deal with. 2. I've written a C++ parser, I know the issues involved, and I know that a parser generator isn't going to fix that.
> delivering the lesson that lexers/parsers are easy in general is, I think, the wrong message to be sending.
I stand by the message :-) in the sense that if a person finds lexing/parsing to be hard, they're likely to find the semantic/optimization/codegen parts of the compiler to be insurmountable.
I've written compilers for numerous languages, including C++, including 2 languages I invented, and lexing & parsing is just not that hard relative to the rest of a compiler.
Sure but you argue that problems of the sort you mention come from ambiguous grammars.
So, hand-created parsers may not flag ambiguous grammars and automatically generated parsers might (I've only done hand created parsers so I don't know).
And Steve Yegge quote just shows how much abstract languages are something you need to learn rather than something you can power your way through. And plenty of good programmers can power their way through almost any other kind of programming challenge so someone who seems very smart doing a very dumb thing in parsing doesn't surprise me (I've tried that myself).
Really, they are easy. They are literally insignificant when you factor in all the hours you'll work on a language.
I wrote another one recently for a small side project. It took more time to write the unit tests for it. The parser practically wrote itself.