ibains made a specific point about language processing in IDEs being different to that of (traditional) compilers. Presumably there exists a state-of-the-art there just as there's a state-of-the-art for conventional compilers, but academia doesn't give it as much attention.
2. Cross sectional variation. There's no such thing as "what industry wants". Every place wants something different, with only a small subset common to 80% of employers.
I don't think that really holds up. Ultimately you could say this about any small mismatch between desired skill-sets and those available, but at some point we call it hair-splitting.
If you need someone to build a parser for your IDE, you're presumably better off with an expert in traditional compilers, than with someone who knows nothing at all of language processing. You'd be better off still with someone experienced in language processing within IDEs. Less mismatch is better, even if it's impractical to insist on no mismatch at all.
Your IDE parser will be unusable if it goes bananas while you're typing the characters needed to get from one fully, correctly parseable state to the next.
It needs to be able to handle:
printf("hello");
and also:
prin
and also:
printf("He
It also needs to be able to autocomplete function signatures that exist below the current line being edited, so the parser can't simply bail out as soon as it reaches the first incomplete or incorrect line.
> Good error recovery / line blaming is still an active field of development.
True. But let's get terminology straight: that's not a compiler science, that's parsing science. And it's no more compiler science than parsing a natural language is.
What terminology are you talking about? Neither "compiler science" nor "parsing science" are terms I used, or that the industry or academia use.
Parsing - formal theory like taxonomies of grammars, and practical concerns like speed and error recovery - remain a core part of compiler design both inside and outside of academia.
How can you be sure that that } is the end of a certain defined block? This most importantly affects the scoping and in many cases it's ambiguous. IDEs do have rich metadata besides from the source code but then parsers should be aware of them.
Maybe my wording is not accurate, imagine the following (not necessarily idiomatic) C code:
int main() {
int x;
{
int x[];
// <-- caret here
x += 42;
}
This code doesn't compile, so the IDE tries to produce a partial AST. A naive approach will result in the first } matching with the second {, so `x += 42;` will cause a type error. But as noticable from the indentation, it is more believable that there was or will be } matching with the second { at the caret position and `x += 42;` refers to the outer scope.
Yes, of course parsers can account for the indentation in this case. But more generally this kind of parsing is sensitive to a series of edit sequences, not just the current code. This makes incremental parsing a much different problem from ordinary parsing, and also is likely why ibains and folks use packrat parsing (which can be easily made incremental).
Partial parse state and recovery are critical. You don't want the entire bottom half of a file to lose semantic analysis while the programmer figures out what to put before a closing ).
Packrat parsers are notably faster than recursive descent parsers (also critical for IDE use) and by turning them "inside out" (replacing their memoization with dynamic programming) you get a pika parser which has very good recovery ability.
There are varying techniques to improve error recovery for all forms of parsing but hacked-up recursive descent (certainly the most common kind of parser I still write for my hacked-up DSLs!) have poor error recovery unless you put in the work. Most LR parsers are also awful by default.
When I was in university most focus was on LL and LR parsers with no discussion of error recovery and more focus on memory usage/bounds than speed. I also have no idea how common it is to teach combinator-based parser grammers these days; stuff like ANTLR and yacc dominated during my studies. This would add another level of unfamiliarity for students going to work on a "real compiler".
> Packrat parsers are notably faster than recursive descent parsers
I think this needs to be qualified. I don't think a packrat parser is going to beat a deterministic top-down parser. Maybe the packrat parser will win if the recursive descent parser is backtracking.
True - the performance of a top-down parser is going to depend on if/how often it backtracks and how much lookahead it requires. But this requires control over your grammar which you might not have i.e. you are parsing a standardized programming language. Practically speaking, unless you want two separate parsing engines in your IDE, that Lisps are actually LL(1) and Python and Java are "almost LL(1)" doesn't get you closer to parsing C++.
Packrat parsers are equivalent to arbitrary lookahead in either LL or LR grammars.
Tree sitter is based on LR parsing (see 23:30 in above video) extended to GLR parsing (see 38:30).
I've had enough of fools on HN posting unverified crap to make themselves feel cool and knowledgeable (and don't kid yourself that you helped me find the right answer by posting the wrong one). Time I withdrew. Goodbye HN.