RunMat Lexer
This crate tokenizes MATLAB/Octave source code into a stream of tokens for the parser.
It uses the logos
library to define a fast, zero-copy DFA with a small amount of
context via LexerExtras
to handle MATLAB-specific ambiguities.
Design goals
- Correct tokenization for the full MATLAB language surface
- Minimal, explicit state for disambiguation (apostrophe transpose vs string, section markers, etc.)
- Compatibility with the rest of the toolchain (parser, HIR, interpreter, JIT)
- Predictable tokens: avoid over-encoding semantics at the lexing stage
Context-aware lexing
We track two pieces of context in LexerExtras
:
last_was_value: bool
— true if the previous emitted token forms a value. Used to disambiguate'
as transpose vs string start.line_start: bool
— true if we are at the beginning of a logical line. Used for%%
section markers.
Tokens overview
- Keywords:
function if elseif else for while break continue return end
- Additional keywords:
switch case otherwise try catch global persistent true false
- OOP keywords:
classdef properties methods events enumeration arguments
- Import:
import
- Identifiers:
[A-Za-z_][A-Za-z0-9_]*
- Numbers: integers and floats with optional exponents
- Strings:
- Single-quoted character arrays:
'...'
with doubled quotes''
inside - Double-quoted string scalars:
"..."
with doubled quotes""
inside
- Single-quoted character arrays:
- Operators and punctuation:
- Arithmetic:
+ - * / \ ^
- Element-wise:
.* ./ .\ .^
- Relational:
== ~= < <= > >=
- Logical:
&& || & | ~
- Transpose:
'
(contextual) - Colon:
:
- Dotted member access:
.
- Function handle/anonymous:
@
- Meta-class query:
?
(e.g.,?MyClass
) - Assignment and separators:
= , ;
- Grouping and containers:
() [] {}
- Arithmetic:
- Comments & layout:
- Line comment:
%
to end of line - Section marker:
%%
at start of line - Block comment:
%{ ... %}
(non-nesting) - Line continuation:
...
(skips remainder of physical line) - Newlines reset
line_start
- Line comment:
Notable disambiguations
- Apostrophe
'
:- If previous token was a value (identifier, number,
) ] }
), emitTranspose
- Otherwise, let the string regex capture a full single-quoted character array
- If previous token was a value (identifier, number,
- Section
%%
:- Only emitted when
line_start == true
; otherwise%
starts a normal line comment
- Only emitted when
- Line continuation
...
:- Emits
Ellipsis
and consumes the remainder of the physical line, including any%
comment following it
- Emits
Non-goals at lexing time
The lexer purposefully does not encode high-level semantics:
- Integer class names like
int8
/uint64
are identifiers - Special variables like
varargin
/varargout
/ans
are identifiers - OOP features (
handle
inheritance, method attributes) are parsed/handled later - Command/function syntax duality is resolved in parsing/semantic phases
Tests
See tests/
for comprehensive coverage, organized by topic:
lexer.rs
: core tokens, operators, single-quoted strings, comments, ellipsistranspose.rs
: detailed diagnostics and assertions for apostrophe ('
) transpose casescomments_continuation.rs
:%
line comments,%{...%}
block comments,%%
section markers,...
continuationoperators.rs
: logical and element-wise operators (e.g.,.* ./ .\ .^ && || & | ~
)namespaces.rs
:import
paths (including wildcard) and metaclass?ClassName
oop_tokens.rs
: OOP keywords (classdef
,properties
,methods
,events
,enumeration
,arguments
) and function handles@
strings_chars.rs
: double-quoted string scalars and apostrophe disambiguation exercisestokens_basic.rs
: identifiers, numbers, separators (; ,
), and simple keyword smoke tests
All lexer tests pass when running the crate tests on their own.
Guidelines for extending the lexer
- Prefer adding new tokens only when lexical distinctions are required.
- When in doubt, keep ambiguous terms as identifiers and resolve in the parser.
- If you need context to disambiguate, add a boolean/flag in
LexerExtras
and use a Logos callback toEmit
orSkip
appropriately. - Keep regular expressions simple (no look-around) and rely on token priority and callbacks for precedence and control.
Known compatibility notes
- Non-conjugate transpose
.'
is tokenized asDot
thenTranspose
. The parser should interpret this pair as the non-conjugating transpose. - Block comments
%{...%}
are treated as non-nesting by design. - Error-recovery is implemented to keep producing useful tokens after invalid input; in recovery mode
double-quoted strings are recognized as a single
Str
token, while malformed single-quoted sequences may be split to allow downstream error reporting.
Remaining edges
- Apostrophe vs string: extreme adjacency cases across
...
continuation and%
comments are covered by tests; a few rare permutations may still be added as seeds (parser semantics unaffected). - Block comments are intentionally non-nesting; any future change would be a parser/runtime decision, not lexing.
- Command-form is resolved in the parser; lexer's role is complete for milestone.
Crate integration
- This crate only produces tokens; it does not attempt to validate grammar.
- Downstream crates (
runmat-parser
,runmat-hir
,runmat-ignition
,runmat-turbine
) are responsible for structure and semantics.