Lexing Engine

Audience

This page is intended for language developers looking to understand mu's lexing engine. The reader will need a full understanding of formal grammars and lexing topics.

Summary

The purpose of the lexing engine is to translate a plain text file in to a sequence of tokens suitable for parsing.

The primary design goals of the lexer are:

Efficiency - Parsing of text is a serial task, the less time spent lexing the better.
Genericy - The file should be lexed in such a way that a parser designed for any purpose can use the same lexer
Usability - Provide the user with a simple and intuitive grammar taking in to consideration user point-of-view issues like keystrokes, keyboard configurations, localization, and inputting of arbitrary text.

The output of the lexer are the following classifications of tokens:

Identifiers - UTF32 strings with no restrictions on characters
Begin/end grouping - Tokens that delineate a group of tokens
Terminator - Token that signifies the end of a list of tokens

US localization special characters:

'[' Begin and ']' end grouping
';' terminator
':' Control character followed by another character to specify the exact control character type
'a' Followed by two hexadecimal characters representing an ASCII character input
'u' Followed by eight hexadecimal characters representing a UTF32 character input
'(' Begin and ')' end of comment
'-' Comment until end of line
'[' ']' ';' ':' Input of special characters
(whitespace) Input of whitespace characters
'{' Begin and '}' end of complex identifier specification
(whitespace) Ends identifiers or else ignored

Identifiers

Simple identifiers are any sequence of non-whitespace characters that doesn't include special characters.

counter  
1function  
emergency!  
valid!@#$%^&*()identifier

Complex identifier

In the case where the user wants an identifier to include special characters or whitespace, a complex identifier can be used. In order to allow the program text to look more like the resulting identifier instead of filled with escape characters, the complex identifier allows the user to specify an arbitrary termination token. A complex identifier is started with, and defines the termination character sequence between: { }

The identifier begins immediately after } and continues until the termination characters are reached. Everything between { and }, exclusive, are the termination characters; this includes whitespace, newlines, and reserved tokens. The identifier is everything between } and the termination characters, exclusive.

identifier
{%}identifier%
{%}identifier with spaces%

Single line comment

A single line comment is a token: :-

followed by anything until an end of line character is reached. Everything starting with the token and ending before the end of line is ignored.

:- Comment to the end of the line

Scoped comment

A scoped comment is a token: :( followed by anything until a token: :) Scoped comments are able to nest. Everything between the tokens including newlines is ignored.

identifier :(comment here:)
identi:(Comment here:)fier
identifier :(comment :(here:):)

Using the resulting token stream of identifiers, groupings, and terminators, language designers can create new parsers to satisfy their needs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly