RunMat Lexer
This crate tokenizes MATLAB/Octave source code into a stream of tokens for the parser.
It uses the logos library to define a fast, zero-copy DFA with a small amount of
context via LexerExtras to handle MATLAB-specific ambiguities.
Design goals
- Correct tokenization for the full MATLAB language surface
- Minimal, explicit state for disambiguation (apostrophe transpose vs string, section markers, etc.)
- Compatibility with the rest of the toolchain (parser, HIR, interpreter, JIT)
- Predictable tokens: avoid over-encoding semantics at the lexing stage
Context-aware lexing
We track two pieces of context in LexerExtras:
last_was_value: bool— true if the previous emitted token forms a value. Used to disambiguate'as transpose vs string start.line_start: bool— true if we are at the beginning of a logical line. Used for%%section markers.
Tokens overview
- Keywords:
function if elseif else for while break continue return end - Additional keywords:
switch case otherwise try catch global persistent true false - OOP keywords:
classdef properties methods events enumeration arguments - Import:
import - Identifiers:
[A-Za-z_][A-Za-z0-9_]* - Numbers: integers and floats with optional exponents
- Strings:
- Single-quoted character arrays:
'...'with doubled quotes''inside - Double-quoted string scalars:
"..."with doubled quotes""inside
- Single-quoted character arrays:
- Operators and punctuation:
- Arithmetic:
+ - * / \ ^ - Element-wise:
.* ./ .\ .^ - Relational:
== ~= < <= > >= - Logical:
&& || & | ~ - Transpose:
'(contextual) - Colon:
: - Dotted member access:
. - Function handle/anonymous:
@ - Meta-class query:
?(e.g.,?MyClass) - Assignment and separators:
= , ; - Grouping and containers:
() [] {}
- Arithmetic:
- Comments & layout:
- Line comment:
%to end of line - Section marker:
%%at start of line - Block comment:
%{ ... %}(non-nesting) - Line continuation:
...(skips remainder of physical line) - Newlines reset
line_start
- Line comment:
Notable disambiguations
- Apostrophe
':- If previous token was a value (identifier, number,
) ] }), emitTranspose - Otherwise, let the string regex capture a full single-quoted character array
- If previous token was a value (identifier, number,
- Section
%%:- Only emitted when
line_start == true; otherwise%starts a normal line comment
- Only emitted when
- Line continuation
...:- Emits
Ellipsisand consumes the remainder of the physical line, including any%comment following it
- Emits
Non-goals at lexing time
The lexer purposefully does not encode high-level semantics:
- Integer class names like
int8/uint64are identifiers - Special variables like
varargin/varargout/ansare identifiers - OOP features (
handleinheritance, method attributes) are parsed/handled later - Command/function syntax duality is resolved in parsing/semantic phases
Tests
See tests/ for comprehensive coverage, organized by topic:
lexer.rs: core tokens, operators, single-quoted strings, comments, ellipsistranspose.rs: detailed diagnostics and assertions for apostrophe (') transpose casescomments_continuation.rs:%line comments,%{...%}block comments,%%section markers,...continuationoperators.rs: logical and element-wise operators (e.g.,.* ./ .\ .^ && || & | ~)namespaces.rs:importpaths (including wildcard) and metaclass?ClassNameoop_tokens.rs: OOP keywords (classdef,properties,methods,events,enumeration,arguments) and function handles@strings_chars.rs: double-quoted string scalars and apostrophe disambiguation exercisestokens_basic.rs: identifiers, numbers, separators (; ,), and simple keyword smoke tests
All lexer tests pass when running the crate tests on their own.
Guidelines for extending the lexer
- Prefer adding new tokens only when lexical distinctions are required.
- When in doubt, keep ambiguous terms as identifiers and resolve in the parser.
- If you need context to disambiguate, add a boolean/flag in
LexerExtrasand use a Logos callback toEmitorSkipappropriately. - Keep regular expressions simple (no look-around) and rely on token priority and callbacks for precedence and control.
Known compatibility notes
- Non-conjugate transpose
.'is tokenized asDotthenTranspose. The parser should interpret this pair as the non-conjugating transpose. - Block comments
%{...%}are treated as non-nesting by design. - Error-recovery is implemented to keep producing useful tokens after invalid input; in recovery mode
double-quoted strings are recognized as a single
Strtoken, while malformed single-quoted sequences may be split to allow downstream error reporting.
Remaining edges
- Apostrophe vs string: extreme adjacency cases across
...continuation and%comments are covered by tests; a few rare permutations may still be added as seeds (parser semantics unaffected). - Block comments are intentionally non-nesting; any future change would be a parser/runtime decision, not lexing.
- Command-form is resolved in the parser; lexer's role is complete for milestone.
Crate integration
- This crate only produces tokens; it does not attempt to validate grammar.
- Downstream crates (
runmat-parser,runmat-hir,runmat-ignition,runmat-turbine) are responsible for structure and semantics.