-
Update lexgen rustc-hash dependency from 1.1.0 to 2.0.0, lexgen_util unicode-width dependency from 0.1.10 to 0.2.0.
-
Lexers can now use
pub(crate)
visibility, and other visibilities supported by Rust and thesyn
crate. Previously onlypub
was supported. -
Eliminate redundant
backtrack
calls in generated code, improving code size and runtime performance. Runtime performance improved 13% in a benchmark. (#69)
-
Lexer type declarations can now have outer attributes other than just
#[derive(Clone)]
. Example:lexer! { /// A lexer for Rust. #[derive(Debug, Clone)] pub RustLexer(LexerState) -> RustToken; ... }
The attributes are directly copied to the generated
struct
. In the example, the documentation andderive
attribute will be copied to the generatedstruct
:/// A lexer for Rust. #[derive(Debug, Clone)] pub struct RustLexer<...>(...);
-
lexgen_util::Lexer
type now derivesDebug
(in addition toClone
). This makes it possible to deriveDebug
in generated lexers. -
syn
dependency updated to version 2.
-
Breaking change: Rules without a right-hand side (e.g.
$$whitespace,
) now always reset the current match. Previously such rules would only reset the current match inInit
. (#12)To migrate, add a semantic action to your rule that just calls
continue_()
on the lexer. For example, if you have$$whitespace,
, replace it with:$$whitespace => |lexer| lexer.continue_(),
-
clippy::manual_is_ascii_check
violations are now ignored in generated code.
-
Fix more
manual_range_contains
lints in generated code. -
let
bindings can now appear insiderule
s. Previouslylet
s were only allowed at the top-level. (#28) -
You can now add
#[derive(Clone)]
before the lexer type name to implementClone
for the lexer type. This can be used to implement backtracking parsers. Example:lexer! { #[derive(Clone)] pub Lexer -> Token; // The struct `Lexer` will implement `Clone` ... }
-
Fix
double_comparison
,manual_range_contains
lints in generated code. (0ecb0b1) -
Lexer constructors
new_with_state
andnew_from_iter_with_state
no longer require user state to implementDefault
. (#54) -
User state can now have lifetime parameters other than
'input
. (#53)
- Lexer state is now reset on failure. (#48)
-
Generated lexers now have two new constructors:
new_from_iter<I: Iterator<Item = char> + Clone>(iter: I) -> Self
new_from_iter_with_state<I: Iterator<Item = char> + Clone, S>(iter: I, user_state: S) -> Self
These constructors allow running a lexer on a character iterator instead of a string slice. Generated lexers work exactly the same way, except the
match_
method panics when called.Locations of matches can be obtained with the
match_loc(&self) -> (Loc, Loc)
method.These constructors are useful when the input is not a flat unicode string, but something like a rope, gap array, zipper, etc. (#41)
-
lexgen_util::Loc
now implementsDefault
. This makes it easier to use lexgen with LALRPOP. (#44)
-
New regex syntax
#
added for character set difference, e.g.re1 # re2
matches characters inre1
that are not inre2
.re1
andre2
need to be "character sets", i.e.*
,+
,?
,"..."
,$
, and concatenation are not allowed. -
Breaking change:
LexerError
type is refactored to add location information to all errors, not justInvalidToken
. Previously the type was:#[derive(Debug, Clone, PartialEq, Eq)] pub enum LexerError<E> { InvalidToken { location: Loc, }, /// Custom error, raised by a semantic action Custom(E), }
with this change, it is now:
#[derive(Debug, Clone, PartialEq, Eq)] pub struct LexerError<E> { pub location: Loc, pub kind: LexerErrorKind<E>, } #[derive(Debug, Clone, PartialEq, Eq)] pub enum LexerErrorKind<E> { /// Lexer error, raised by lexgen-generated code InvalidToken, /// Custom error, raised by a semantic action Custom(E), }
-
A new syntax added for right contexts. A right context is basically lookahead, but can only be used in rules and cannot be nested inside regexes. See README for details. (#29)
New version published to fix broken README pages for lexgen and lexgen_util in crates.io.
-
Breaking change: Starting with this release, lexgen-generated lexers now depend on
lexgen_util
package of the same version. If you are using lexgen version 0.8 or newer, make sure to addlexgen_util = "..."
to yourCargo.toml
, using the same version number aslexgen
. -
Common code in generated code is moved to a new crate
lexgen_util
. lexgen-generated lexers now depend onlexgen_util
. -
Breaking change: Line and column tracking implemented. Iterator implementation now yields
(Loc, Token, Loc)
, whereLoc
is defined inlexgen_util
asstruct Loc { line: u32, col: u32, byte_idx: usize }
. -
Fixed a bug when initial state of a rule does not have any transitions (rule is empty). (#27, 001ea51)
-
Fixed a bug in codegen that caused accidental backtracking in some cases. (#27, 001ea51)
-
Fixed a bug that caused incorrect lexing when a lexer state has both range and any (
_
) transitions. (#31)
-
Regex syntax updated to include "any character" (
_
) and "end of input" ($
).Previously "any character" (
_
) could be used as a rule left-hand side, but was not allowed in regexes. -
Semantic action functions that use user state (
state
method of the lexer handle) no longer needmut
modifier in the handle argument.This will generate warnings in old code with semantic actions that take a
mut
argument. -
New lexer method
reset_match
implemented to reset the current match.
-
Fixed precedences of concatenation (juxtaposition) and alternation (
|
). -
Fixed lexing in lexers that require backtracking to implement longest match rule. (#16)
-
Accepting states without transitions are now simplified in compile time and semantic actions of such states are inlined in the states that make a transition to such accepting states. In Lua 5.1 lexer this reduces a benchmark's runtime by 14.9%. (#7)
Note that this potentially duplicates a lot of code in the generated code when some states have large semantic action codes and lots of incoming edges in the DFA. However in practice I haven't observed this yet. (#8)
-
DFA states with one predecessor are now inlined in the predecessor states. This reduces code size and improves runtime performance. (33547ec)
-
We now reset the current match after returning a token (with
return_
andswitch_and_return
). (#11)
-
lexgen now comes with a set of built-in regular expressions for matching Unicode alphanumerics, uppercases, whitespaces etc. See README for details.
-
Fixed a few issues with end-of-stream handling (cbaabe2)
- Fixed handling of overlapping ranges in a single NFA/DFA state. (#3)
LexerError
type now implementsClone
andCopy
.
- Fixed various bugs in
_
pattern handling.
-
It is now possible to use the special lifetime
'input
in your token types to borrow from the input string. Example:enum Token<'input> { Id(&'input str), } lexer! { Lexer -> Token<'input>; rule Init { [' ' '\t' '\n']; // skip whitespace ['a'-'z']+ => |lexer| { let match_ = lexer.match_(); lexer.return_(Token::Id(match_)) }, } }
See also the Lua 5.1 lexer example, which is updated to use this feature.
-
The
rule Init { ... }
syntax can now be omitted when you don't need named rule sets. For example, the example in the previous changelog entry can be simplified as:lexer! { Lexer -> Token<'input>; [' ' '\t' '\n'], // skip whitespace ['a'-'z']+ => |lexer| { let match_ = lexer.match_(); lexer.return_(Token::Id(match_)) }, }
-
pub
keyword before a lexer name now generates the type aspub
. Useful for using the generated lexer in other modules. Example:lexer! { pub Lexer -> Token; ... }
-
Two new action kinds: "fallible" and "simple" added. The old ones defined with
=>
are now called "infallible".-
"fallible" actions are defined with
=?
instead of=>
. The difference from infallible actions is the return type isResult<Token, UserError>
, instead ofToken
, whereUserError
is defined usingtype Error = ...;
syntax. LHS can have a<'input>
lifetime parameter when borrowing from the user input in the error values. When a user error type is defined, the lexer error struct becomes an enum, with two variants:enum LexerError { LexerError { char_idx: usize }, UserError(UserError), }
-
"simple" actions are defined with
=
instead of=>
. The RHS needs to be a value for a token, instead of a closure for a lexer action. This rule kind is useful when matching keywords and other simple tokens in a language. Example:lexer! { Lexer -> Token; '(' = Token::LParen, ')' = Token::RParen, ... }
The syntax
<regex> = <expr>
is syntactic sugar for<regex> => |lexer| lexer.return_(<expr>)
.
-
- Initial release