Parsing textual notation has a long history in computer science. This long time has produced many great and efficient parsing algorithms, which were optimized to the extreme by compilers for different programming languages. However in the recent years the development of integrated development environments are accelerated, adding more and more services to previously simple features like code editors and automatic background compilation. These with the rise of domain specific language frameworks which allows the user to define an arbitrary language declaratively causes that present-day computers still struggle under the resource requirements of these technologies.
While waiting in front of a textual editor, I’ve started thinking about why does the editor need to re-run the parsing algorithm every time I hit button (of course I’m aware of the fact that the main cause of the slow responsibility is probably not the parsing algorithm itself). A fully incremental approach could make these tools more responsive which would remove a lot of pain from its users. As I’m already familiar with an incremental query engine (IncQuery) the thought about using it to parse text couldn’t leave my mind.
The following article presents the experiment I’ve done as a proof-of-concept. It’s far from being useful and reveals more problems than it solves it does not produce an AST just accepts or rejects the input string, however it may be interesting as an approach and maybe in the future it will be more than just a stupid idea.
The input string as a model
To allow the IncQuery engine to work on the input string, it has to be presented as an EMF model. Because the interesting part is the parsing itself, a tokenizer scans the input and produces a list of tokens. This approach allows a simpler parser as it does not handle the actual token values it just needs the types of the tokens.
For technological reasons, the token list is represented as a linked list instead of an indexed list. This has caused some problems (I will tell about them later), but currently it was the only way I could make it work.
My tokenizer is NOT incremental. There may be a way to make one, but it may depend on how the terminals are defined and I did not want to spend too much time with it. However, my tokenizer DOES produce incremental results. When the input string changes the tokenizer rescans it, compares the result to the previous token list then updates the token list in a minimal set of steps.
The parser is a set of query patterns. Basically, for every non-terminal in the grammar (in Chomsky Normal Form or equivalent) a pattern can be defined. Creating a pattern based on a rule is pretty straight-forward thanking to the declarative nature of CNF.
Every rule-pattern has the same signature: ‘pattern NonTerminal(first: Token, next: Token)’. The pattern matches to Token pairs if the rule can generate the string represented by the consecutive part of the token list starting from first (inclusive) and ending with next (exclusive). The parser accepts the input string if the Starting rule pattern matches to (FirstToken, EOF) pair where EOF is a special token added to the end of the token list to indicate the last Token.
The body of the pattern can be created by following these simple rules:
- For every alternative rule body, a pattern body shall be created.
- For a non-terminal in the rule body, the pattern body contains a pattern call for the pattern of the referenced rule.
- For every terminal in the rule body, the pattern tries to match for the terminal in the token list.
- Consecutive constraints must be forced to match consecutive tokens by daisy-chaining
their parameters so the first constrains starts with ‘first’, and the last one ends with ‘next’.
Example: simple infix expressions
Special cases: optional and multiplied rule body parts
Although grammars containing optional and multiple rule body parts can be rephrased in Chomsky Normal Form they can also be represented as queries.
- For an empty rule body, the query shall contain a single line: ‘first==next;’
- For an optional part, a sub-rule can be created with two bodies: an empty rule (see above) and a body with the proper constraints:
- For multiplied rule parts, recursion can be used:
After the experiment, I’ve revealed the following problems of the approach:
- Tokenizer: a real incremental tokenizer should be created. Currently, tokenizing has an O(n*l) cost (where n is the number of terminals in the grammar, and l is the length of input). Additionally, comparing the new token list to the old one has O(m*m) cost (m: size of the token list). Cost applying changes to the token list is proportional to the number of changes, therefore it is not significant. The summarized cost of updating the token list however is a significant bottle-neck.
- Token list as a linked list: a single insert in the linked list consists of the removal of the tail, adding the new element then re-attaching the tail. During the process the part of the token list which is after the added token is removed from the resource causing IncQuery to unload it then reload when it’s added again (practically killing every gain caused by IncQuery). An indexed list would be much better as it has atomic insert and remove operations, but that could not be used because of the following reasons:
- EMF (2.9) generates unique lists for cross-references regardless of the isUnique parameter in the ecore file. This may be resolved by using INT identifiers to denote token types instead of referencing the terminals defined by the grammar.
- IncQuery currently cannot match against an index of an element in a list. Only constants can be used to index lists.
- Producing an AST: It is possible to extract which rules are matched at a certain position, thus it is possible to recreate the abstract syntax tree in a particular state of the query engine, but to do this incrementally is a much more complex issue.
In conclusion the experiment was more of a failure than a success. But that’s what experiments are for. The experience and the conclusions are still can be useful.