RunMat Lexer

This crate tokenizes MATLAB/Octave source code into a stream of tokens for the parser. It uses the logos library to define a fast, zero-copy DFA with a small amount of context via LexerExtras to handle MATLAB-specific ambiguities.

Design goals

Correct tokenization for the full MATLAB language surface
Minimal, explicit state for disambiguation (apostrophe transpose vs string, section markers, etc.)
Compatibility with the rest of the toolchain (parser, HIR, interpreter, JIT)
Predictable tokens: avoid over-encoding semantics at the lexing stage

Context-aware lexing

We track two pieces of context in LexerExtras:

last_was_value: bool — true if the previous emitted token forms a value. Used to disambiguate ' as transpose vs string start.
line_start: bool — true if we are at the beginning of a logical line. Used for %% section markers.

Tokens overview

Keywords: function if elseif else for while break continue return end
Additional keywords: switch case otherwise try catch global persistent true false
OOP keywords: classdef properties methods events enumeration arguments
Import: import
Identifiers: [A-Za-z_][A-Za-z0-9_]*
Numbers: integers and floats with optional exponents
Strings:
- Single-quoted character arrays: '...' with doubled quotes '' inside
- Double-quoted string scalars: "..." with doubled quotes "" inside
Operators and punctuation:
- Arithmetic: + - * / \ ^
- Element-wise: .* ./ .\ .^
- Relational: == ~= < <= > >=
- Logical: && || & | ~
- Transpose: ' (contextual)
- Colon: :
- Dotted member access: .
- Function handle/anonymous: @
- Meta-class query: ? (e.g., ?MyClass)
- Assignment and separators: = , ;
- Grouping and containers: () [] {}
Comments & layout:
- Line comment: % to end of line
- Section marker: %% at start of line
- Block comment: %{ ... %} (non-nesting)
- Line continuation: ... (skips remainder of physical line)
- Newlines reset line_start

Notable disambiguations

Apostrophe ':
- If previous token was a value (identifier, number, ) ] }), emit Transpose
- Otherwise, let the string regex capture a full single-quoted character array
Section %%:
- Only emitted when line_start == true; otherwise % starts a normal line comment
Line continuation ...:
- Emits Ellipsis and consumes the remainder of the physical line, including any % comment following it

Non-goals at lexing time

The lexer purposefully does not encode high-level semantics:

Integer class names like int8/uint64 are identifiers
Special variables like varargin/varargout/ans are identifiers
OOP features (handle inheritance, method attributes) are parsed/handled later
Command/function syntax duality is resolved in parsing/semantic phases

Tests

See tests/ for comprehensive coverage, organized by topic:

lexer.rs: core tokens, operators, single-quoted strings, comments, ellipsis
transpose.rs: detailed diagnostics and assertions for apostrophe (') transpose cases
comments_continuation.rs: % line comments, %{...%} block comments, %% section markers, ... continuation
operators.rs: logical and element-wise operators (e.g., .* ./ .\ .^ && || & | ~)
namespaces.rs: import paths (including wildcard) and metaclass ?ClassName
oop_tokens.rs: OOP keywords (classdef, properties, methods, events, enumeration, arguments) and function handles @
strings_chars.rs: double-quoted string scalars and apostrophe disambiguation exercises
tokens_basic.rs: identifiers, numbers, separators (; ,), and simple keyword smoke tests

All lexer tests pass when running the crate tests on their own.

Guidelines for extending the lexer

Prefer adding new tokens only when lexical distinctions are required.
When in doubt, keep ambiguous terms as identifiers and resolve in the parser.
If you need context to disambiguate, add a boolean/flag in LexerExtras and use a Logos callback to Emit or Skip appropriately.
Keep regular expressions simple (no look-around) and rely on token priority and callbacks for precedence and control.

Known compatibility notes

Non-conjugate transpose .' is tokenized as Dot then Transpose. The parser should interpret this pair as the non-conjugating transpose.
Block comments %{...%} are treated as non-nesting by design.
Error-recovery is implemented to keep producing useful tokens after invalid input; in recovery mode double-quoted strings are recognized as a single Str token, while malformed single-quoted sequences may be split to allow downstream error reporting.

Remaining edges

Apostrophe vs string: extreme adjacency cases across ... continuation and % comments are covered by tests; a few rare permutations may still be added as seeds (parser semantics unaffected).
Block comments are intentionally non-nesting; any future change would be a parser/runtime decision, not lexing.
Command-form is resolved in the parser; lexer's role is complete for milestone.

Crate integration

This crate only produces tokens; it does not attempt to validate grammar.
Downstream crates (runmat-parser, runmat-hir, runmat-ignition, runmat-turbine) are responsible for structure and semantics.