Most of the time I try to avoid reinventing the wheel. And most of the time I fail and forced to do so. That’s what happened to me when I decided to write my own textual modeling framework. I’ve used (and currently using) Xtext in several projects and tried EMFText once. Both of them are great projects with great possibilities, but I’ve come across a problem which required a different approach. I needed a more dynamic toolkit, where the language is extensible and easier to maintain.
Xtext is a true leviathan when it comes to its features, it works out-of-box with a few clicks and gives a really feature-rich and customizable editor for your DSL. However, if you want something lightweight, it quickly becomes a dependency hell. Even if you don’t use Xbase you can’t leave out JDT from your project. It’s also a pain in the ass to configure a headless parser. And the top on this, newer versions of Xtext is often not compatible with generated code created by older Xtext and vice-versa. One of the main motivations of EMFText is to address these problems. The generated code of EMFText is truly standalone, it only depends on a common Antlr plugin.
But both technologies come with a bunch of generated code to keep up-to-date and maintain. Of course, the generated code can be removed from the version control, the generation itself can be moved into a build script to run on CI, but there are some generated first-time then extended by hand parts (e.g. scoping in Xtext). So I thought do we really need all this code generation? Couldn’t a grammar model be used as it is to parse a text? I gave it a shot and found out it’s possible.
To eliminate code generation, I had to drop Antlr and any other parser generator toolkits. The parser algorithm shall be independent from the used grammar model, so I decided to use an Earley parser. Obviously the downside of the approach is performance, but that’s the point: trading speed for flexibility.
A simple example
For a quick show of the features, I’ve created a simple example grammar. First, as expected from every textual modeling toolkit, there is a grammar definition for the grammar model itself, which makes it possible to edit the grammar in a convenient syntax highlighting editor:
To make the upper grammar work, a few hand-written parts are necessary:
- A resource factory implementation, to register the file type
- A resource implementation based on AbstractTextualResource, which connects grammar to the resource type and delegates feature resolving to java code.
- Extensions to register the grammar file, the resource factory and the editor
If everything goes as expected, you can try your new language with a convenient syntax highlighting editor:
The parsed model and the abstract syntax tree (for debugging the grammar) is shown in the outline view of the editor. The view gets the icon and label decorations of the elements from the generated EMF adapter factories:
Textualmodeler on GitHub: https://github.com/balazsgrill/textualmodeler/
Interesting proposal. I do agree that Xtext is a bit heavy on dependencies and their compatibility is not the best, however EMFText is sadly somewhat less fine-tuned to me.
The removal of code generation is not that big issue for me, as although it is possible to parse the grammar during runtime, although I believe in the performance gains achieved by optimizing the parser generation-time. Additionally, it is easier to extend generated code (if well-designed, such as Xtext’s) than a generic framework (without any kind of typesafe wrapper). Additionally, I believe, the Antlr guys are writing a better parser than me… 🙂
What would be interesting to know how good your approach handles bad inputs. That is one point I really like Antlr/Xtext, as out of the box it is handled quite well, and can be extended by model-specific parts.
As always, you managed to point out the weak spot of an approach at first glance 🙂
Early algorithm cannot finish if there is a syntax error in the input, currently my best approach is to determine the last character of the input which could be parsed. Although I have an idea to modify the Earley algorithm so it can determine the minimal changes in the input (terminal additions or removals) that would make it valid. This would enable the implementation of some higher level features like content assist.
That first glance is cheating: I am teaching how important it is to handle incorrect inputs, and regularly test students code in this regard. 🙂
This minimal changes approach sounds interesting, however, I do not know the internals to determine how good this will be.
But again, it is an interesting experiment; I would be glad to hear more about this.