parser – cubussapiens.hu

Parsing textual input with IncQuery

Parsing textual notation has a long history in computer science. This long time has produced many great and efficient parsing algorithms, which were optimized to the extreme by compilers for different programming languages. However in the recent years the development of integrated development environments are accelerated, adding more and more services to previously simple features like code editors and automatic background compilation. These with the rise of domain specific language frameworks which allows the user to define an arbitrary language declaratively causes that present-day computers still struggle under the resource requirements of these technologies.

While waiting in front of a textual editor, I’ve started thinking about why does the editor need to re-run the parsing algorithm every time I hit button (of course I’m aware of the fact that the main cause of the slow responsibility is probably not the parsing algorithm itself). A fully incremental approach could make these tools more responsive which would remove a lot of pain from its users. As I’m already familiar with an incremental query engine (IncQuery) the thought about using it to parse text couldn’t leave my mind.

The following article presents the experiment I’ve done as a proof-of-concept. It’s far from being useful and reveals more problems than it solves it does not produce an AST just accepts or rejects the input string, however it may be interesting as an approach and maybe in the future it will be more than just a stupid idea. Continue reading “Parsing textual input with IncQuery”

Yet another textual modeling framework

Most of the time I try to avoid reinventing the wheel. And most of the time I fail and forced to do so. That’s what happened to me when I decided to write my own textual modeling framework. I’ve used (and currently using) Xtext in several projects and tried EMFText once. Both of them are great projects with great possibilities, but I’ve come across a problem which required a different approach. I needed a more dynamic toolkit, where the language is extensible and easier to maintain. Continue reading “Yet another textual modeling framework”

Enabling documentation comments in LPG parsers

Sometimes I fell, using LPG parsers is like balancing between the fine line of total madness and fine, easy results. The results are (often) fine, as the generator produces a good enough parser, with error recovery mechanisms and similar (I don’t want to compare it with other generators), but sometimes the lack of documentation can be prohibiting.

A problem I wanted to use documentation comments and multiline comments in a Java-like syntax (so /* comment */ describes a comment, while /** comment */ describes a documentation comment). I tried to find either examples or documentation about the solution to this problem, but couldn’t find any (at least until recently).

The solution consists of two parts: the lexer should be capable of differentiating between the two comment types (it should be able to emit the corresponding tokens), and then either the parser or the resolver should be able to read the comments.

Lexing issues

The first problem is hard, as we have to create the rules describing the comment tokens in a constructive way: there is no inversion, and it is better if the two rules are non-conflicting. The basic solution works like the following: a multiline comment starts with two characters, '/' and '*', followed by a not-star character, then followed as a series of character series not consisting of the ending '*''/' tag.

So far it sounds easy, but the grammar language of the LPG generator does not support inverse. Luckily we have a finite alphabet, so the number of two character long series are also finite, which means, it is possible to write the previously mentioned system in a constructive way:

doc ::= '/' '*' '*' CommentBody Stars '/'
mlc ::= '/' '*' NotStar CommentBody Stars '/'
CommentBody$$CommentBody ::= CommentBody Stars NotSlashOrStar |
CommentBody '/' | CommentBody NotSlashOrStar |
Stars NotSlashOrStar | '/' | NotSlashOrStar

NotSlashOrStar ::= letter | digit | whiteChar | AfterASCII |
'+' | '-' | '(' | ')' | '"' | '!' | '@' | '`' | '~' | '.' |
'%' | '&' | '^' | ':' | ';' | "'" | '\' | '|' | '{' | '}' |
'[' | ']' | '?' | ',' | '<' | '>' | '=' | '#' | '$' | '_'

Yes, quite hard to write, and even harder to understand first, but at least it is working.

Parsing the comments

After the lexer understands the comments, the parser needs also to be aware of them. A (seemingly) easy way to handle this is the following: the multiline comment is exported in the lexer as a comment token, while the documentation comment is exported as a simple token. After this the parser could include references to documentation comments and handled similary to all other tokens.

On the other hand, the approach has a serious drawback: the documentation comment is not a comment, so it cannot be placed anywhere, as it is limited by the grammar language. At first this looks not a too big problem, but there could be several problems: first, if you are enhancing an existing language, this could break some texts – possibly created by other users – and the default error message is hard to understand. The second problem is the following: the system disallows having long lines of stars at the beginning of the simple multiline comments, as in the following snippet:

/*********************
*                    *
*  Comment in a box  *
*********************/

Why: the comments are not started with two or more star characters, so this block is interpreted as a block comment.

If these problems cannot be ignored, both the comments and documentation comments shall be marked as comment tokens, that are swallowed by the lexer, the parser does see any comment tokens (as it should be). The drawback of this approach is that the position of the documentation comments cannot be controlled in the parser – so the handling has to ignore the documentation comments at invalid locations.

A final step is needed to read these documentation comments: they have to be read, but they are not present in the AST – at least directly. On the other hand, following the tip from the presentation the EclipseCon 2009 Tutorial: “PIMP Your Eclipse: Building an IDE using IMP” preceding adjuncts of an AST node can be read. This terminology was not clear for me, and the missing Javadoc was neither helpful, so this tutorial was a great help.

To tell the truth, I am interested in the correct meaning of the definition used in LPG for adjuncts, but as much I know, the comment tokens are included in it – in their raw form. This raw form means, every character, including whitespaces, starting and ending characters are included, so they might need another parsing step with a different grammar.

But for a short handling, the following code can be used to read the first documentation comment before an element:

public String parseDocComments(ASTNode node) {
  IToken[] adjuncts = node.getPrecedingAdjuncts();
  for (IToken adjunct : adjuncts) {
    String adjString = adjunct.toString();
    if (adjString.startsWith("/**")) {
      return adjString;
    }
  }
  return null;
}

Conclusion

To allow the use of documentation comments can be quite a bit of challange, especially getting the syntax right, without any conflict between rules, but it is certainly possible.

On the other hand, the language of documentation comments has to be defined again, where even the lexer could not be reused from the original grammar, as it uses different terminal rules (e.g. the documentation comment shall not be a comment token in this language). Even worse, having two different grammars makes it harder to provide correct coding help from the IDE (e.g. content assist, source coloring, etc.). These ways need further experimenting with the tools, but at least the solution is working right now.

Generating LPG 1.0 parsers on OSX using Eclipse

In fall I began maintaining the parser of the VIATRA2 framework. Funny.

Mostly because it uses the LPG parser generator framework, and to make things worse, a very old version (v1.1) of it. Today it is available a new 2.0 version (since 2008), but they are not compatible at all, e.g. they define define packages in the LPG runtime. As the release was near, there was no chance of upgrading the parser, so we were stuck with version 1.0.

The problem with the old version is, that although it is written in C++, even its makefile uses explicitely the Visual C++ compiler, so simply compiling it for OSX is not possible. That means, every time I have to change the grammar file, I have to start a Windows binary. And I like to do it from Eclipse.

My two chances were Wine and VMware (not Parallels, because I don’t have a licence for it 🙂 ). The latter is too hard on resources and is so much harder to integrate with my Eclipse in OSX, so the first choice was Wine. Luckily the Wine developers did quality work, so the LPG generator binary can be run with wine.

The Eclipse integration is not too hard (at least in a basic way, that would work for a while), as there is support for running External tools using the appropriate icon from the toolbar (or from the Run menü).

Such an External tool can be parameterized using various variables of Eclipse, of which two are needed:

[cci]$resource_loc[/cci]: the file system path (not workspace-relative path) of the selected resource
[cci]$container_loc[/cci]: the the container folder’s (or directory) location, that holds the selected resource (also in the file system)

The tool will be the wine installation, as it will execute the lpg.exe binary, that will receive it as a runtime parameter. This way both the location of the lpg.exe binary and the lpg parameters have to be written to the tools parameters section. It is important to note, that the location of the lpg binary can be given using OSX paths, there is no need to translate them into Wine paths, Wine can handle native OSX paths.

LPG uses a working folder, where it puts the generated parser and AST classes. This will be defined using the [cci]$container_loc[/cci] variable.

LPG needs three types of information: the grammar file (that can be given as a parameter to LPG, we will use the [cci]$resource_loc[/cci] variable), an includes directory (for grammar snippets) and a templates directory (for parser and lexer templates).

The directories can either be found in the working directory (this is needed for local templates), given as parameters or set as environment variables. I choose the third one, as it seemed the most maintainable solution.

For this reason the [cci]LPG_INCLUDE[/cci] and the [cci]LPG_TEMPLATE[/cci] environment variables have to be set on the Environment variables tab respectively.

The described settings (except the environment variables) are shown on the following screenshot:

LPG futtatása Wine segítségével az aktuális fájlra — Running LPG with Wine on the current selection

After these settings are done, by selecting the parser.g file, it becomes possible to run this new tool, that will generate the various parser-related Java classes.

After running the tool, the console output of the lpg generator is shown, where all paths are listed beginning with [cci]Y:\[/cci], although the selected files appear in the folder structure of the Eclipse workspace.

There are some minor shortcomings of this integration: first I cannot use the pop-up menu to execute this tool, as the external tools are not listed. Another annoyance is, that the file has to be selected in Navigator view, the open editor is not enough.

This means, I have to select first the file in the Project Navigator (or Package Explorer, etc.), then run the tool manually from the Run configuration menu. Quite disturbing, but the grammar does not need to be changed too often.

Another problem is, that the error output of the generator is not back-annotated as Eclipse errors (problem markers), only a console output is available. For a brand new grammar this would be not the best solution, but for maintenance it is enough.

The LPG IDE of the IMP (IDE Metatooling Platform) project overcomes this challange by using a newer version of LPG, that is written in cross-platform C (or C++), and uses a builder (that automatically calls the LPG binary if the grammar files are changed), and the builder results are showed as proper error messages.

This means, the future for LPG development in Eclipse is the LPG IDE, but for legacy projects it cannot be used. In these cases my solution can become a good alternative.

LPG generálás OSX-en Eclipse-ből

Megnyertem egy parser frissítésének és karbantartásának feladatát. Igen, ez remekül hangzik. Ahogy az is.

A parser az LPG parser generátorral készült, méghozzá annak az 1.0-s változatával. Most már van 2-es is, ami természetesen nem kompatibilis a régivel (legalábbis generált kód szintjén semmiképp sem – többek között más java package-et használ). Miután nem sokkal release előtt kaptam meg, frissíteni most biztos nem lehet (később meg valószínűleg úgyis kellene).

Na, tehát ott tartottam, hogy 1-es verzió. Minden nagyon szép, minden nagyon jó, mindennel meg vagyok elégedve, úgyhogy módosítottam a grammar fájlt. Na, ideje újragenerálni a kódot. És itt jön a feketeleves: az LPG parser generátor régi verziójához csak egy Windows-os exe fájl van, azzal lehet futtatni. Természetesen forráskód is van, de még a makefile is a Visual C++ fordítóra van kihegyezve. Szóval lefordítani macerás.

Nem is kezdem el, mert feltehetőleg csak ideiglenes megoldás kell (max. 1-2 év 😀 ). Ugyanakkor cél, hogy a megoldás integrálódjon az Eclipse-be, azaz néhány klikkeléssel sikerüljön a programot elindítani. A lehetőségeim: wine vagy VMware.

Az utóbbi nem tetszene, mert relatíve sok erőforrást eszik, ráadásul egyszerűen csak a VMware alatt futó Eclipse példánnyal lehetne összekapcsolni, amelynek a gyorsbillentyűi teljesen mások, mint a natív Mac-es példányé.

Szóval lehet reménykedni, hogy a wine-osok jó munkát végeztek. És szerencsém van, mert az lpg.exe gond nélkül futtatható vele.

Most már csak az Eclipse integráció van hátra. Ennek remek eszköze az External Tools eszköz (megjegyzés: natív Windows-on is csak így lehet futtatni az lpg.exe-t Eclipse-ből – nincs jobb támogatás) a Run menüben.

Létrehozhatunk egy saját eszközt, amelynek felparaméterezéséhez használhatjuk az Eclipse különböző változóit. Számunkra ehhez kettőre van szükség:

[cci]${resource_loc}[/cci]: az aktuálisan kijelölt erőforrás elérhetősége a fájlrendszerben (nem workspace-relatív módon!)
[cci]${container_loc}[/cci]: az aktuális erőforrást tartalmazó mappa elérhetősége (szintén nem workspace-relatív módon)

Az LPG parser generátor számára fontos a munkakönyvtár beállítása, ide fogja generálni a fájlokat. A többi adat kitöltése magától értetődő, ezért csak egy képernyőfotót illesztek be róla.

Az LPG futásához három dologra van szükség: a nyelvtan fájlra vagy fájlokra, az include fájlokra és a template fájlokra. Ezek lehetnek mind a munkakönyvtárban (ez a helyzet, ha saját sablonokat használunk), vagy pedig környezeti változók által kijelölt mappában, esetleg paraméterként is át lehet adni.

Szerintem a legtisztább a környezeti változók használata, ezért az Environment fülön felvettem az [cci]LPG_INCLUDE[/cci] és az [cci]LPG_TEMPLATE[/cci] környezeti változókat, azokat a megfelelő mappákra irányítva.

Ezután a futtatás gombra kattintva jött a varázslat: a wine az OSX-es útvonalakat lefordítja a Windows-os program számára érthető formátumra (megfigyelhetőek az Y:\ kezdetű útvonalak a szöveges kimeneten – amik természetesen megjelennek az Eclipse Console view-ban), és az ugyanilyen formátumban készülő fájlok megjelennek az OSX-es mappában. Sőt, a környezeti változókra is igaz ez. Nagyon cool.

A technológiával két kisebb gondom van: nem tudom az LPG-t így a .g fájl jobb gombos menüjéből futtatni (nincs ott külső eszköz futtatásához lehetőség), és nem működik a megoldás, ha nem a Navigator view aktív a Run external tool használatakor (természetesen akkor sem, ha nem a .g fájl van kijelölve, de ez természetes :D). Van ezekre valakinek valami ötlete?