7.3. Regular expression library

The REGEX module implements regular expression matching and searching. It provides regex_compile for building patterns, regex_match for full-string matching, regex_search for finding the first match anywhere, regex_foreach for iterating all matches, regex_replace for substitution (both block-based and template-string forms), regex_split for splitting strings, regex_match_all for collecting all match ranges, regex_group for capturing groups by index, and regex_group_by_name for named group lookup.

See Regular Expressions for a hands-on tutorial.

Supported syntax:

  • . — any character except newline (use dot_all=true to also match \n)

  • ^ — beginning of string (or offset position)

  • $ — end of string

  • + — one or more (greedy)

  • * — zero or more (greedy)

  • ? — zero or one (greedy)

  • +? — one or more (lazy)

  • *? — zero or more (lazy)

  • ?? — zero or one (lazy)

  • {n} — exactly n repetitions

  • {n,}n or more (greedy)

  • {n,m} — between n and m (greedy)

  • {n}? {n,}? {n,m}? — counted repetitions (lazy)

  • (...) — capturing group

  • (?:...) — non-capturing group

  • (?P<name>...) — named capturing group

  • (?=...) — positive lookahead assertion

  • (?!...) — negative lookahead assertion

  • | — alternation

  • [abc], [a-z], [^abc] — character sets (negated with ^)

  • \w \W — word / non-word characters

  • \d \D — digit / non-digit characters

  • \s \S — whitespace / non-whitespace characters

  • \b \B — word boundary / non-boundary assertions

  • \t \n \r \f \v — whitespace escapes

  • \xHH — hexadecimal character escape

  • \. \+ \* \( \) \[ \] \| \\ \^ \{ \} — escaped metacharacters

Flags:

  • case_insensitive=true — ASCII case-insensitive matching (pass to regex_compile)

  • dot_all=true. also matches \n (pass to regex_compile)

Template-string replacement:

regex_replace(re, str, replacement) replaces matches using a template string. Supported references: $0 or $& for the whole match, $1$9 for numbered groups, ${name} for named groups, $$ for a literal $.

The engine is ASCII-only (256-bit CharSet). Matching is anchored — regex_match tests from position 0 (or the given offset) and does NOT search; use regex_search to find the first occurrence, or regex_foreach / regex_match_all to find all occurrences.

See also Boost package for REGEX for compile-time regex construction via the %regex~ reader macro.

All functions and symbols are in “regex” module, use require to get access to it.

require daslib/regex

Example:

require daslib/regex
    require strings

    [export]
    def main() {
        var re <- regex_compile("[0-9]+")
        let m = regex_match(re, "123abc")
        print("match length = {m}\n")
        let text = "age 25, height 180"
        regex_foreach(re, text) $(r) {
            print("found: {slice(text, r.x, r.y)}\n")
            return true
        }
    }
    // output:
    // match length = 3
    // found: 25
    // found: 180

7.3.1. Type aliases

regex::CharSet = uint[8]

Bitfield character set used internally by the regex engine.

regex::ReGenRandom = iterator<uint>

Random number generator callback used by re_gen for regex-based string generation.

regex::variant MaybeReNode

Regex node or nothing.

Variants
  • value : ReNode? - Node.

  • nothing : void? - Nothing.

7.3.2. Enumerations

regex::ReOp

Type of regular expression operation.

Values
  • Char = 0 - Matching a character

  • Set = 1 - Matching a character set

  • Any = 2 - Matches any character

  • Eos = 3 - Matches end of string

  • Bos = 4 - Matches beginning of string

  • Group = 5 - Matching a group

  • Plus = 6 - Repetition: one or more

  • Star = 7 - Repetition: zero or more

  • Question = 8 - Repetition: zero or one

  • Concat = 9 - First followed by second

  • Union = 10 - Either first or second

  • Repeat = 11 - Counted repetition: {n}, {n,}, {n,m}

  • WordBoundary = 12 - Matches at a word boundary

  • NonWordBoundary = 13 - Matches at a non-word boundary

  • Lookahead = 14 - Positive lookahead assertion (?=…)

  • NegativeLookahead = 15 - Negative lookahead assertion (?!…)

7.3.3. Structures

regex::ReNode

Regular expression node.

Fields
  • op : ReOp - Regex operation

  • id : int - Unique node identifier

  • fun2 : function<(regex: Regex;node: ReNode?;str:uint8?):uint8?> - Matchig function

  • gen2 : function<(node: ReNode?;rnd: ReGenRandom;str: StringBuilderWriter):void> - Generator function

  • at : range - Source range

  • text : string - Text fragment

  • textLen : int - Length of text fragment

  • all : array< ReNode?> - All child nodes

  • left : ReNode? - Left child node

  • right : ReNode? - Right child node

  • subexpr : ReNode? - Subexpression node

  • next : ReNode? - Next node in the list

  • cset : CharSet - Character set for character class matching

  • index : int - Index for character class matching

  • min_rep : int - Minimum repetition count for counted quantifiers

  • max_rep : int - Maximum repetition count for counted quantifiers (-1 means unlimited)

  • lazy : bool - Whether this quantifier uses lazy matching (*?, +?, ??, {n,m}?)

  • tail : uint8? - Tail of the string

regex::Regex

Regular expression structure.

Fields
  • root : ReNode? - Root node of the regex.

  • match : uint8? - Original source text.

  • groups : array<tuple<range;string>> - Captured groups.

  • earlyOut : CharSet - Character set for early out optimization.

  • canEarlyOut : bool - Whether early out optimization is enabled.

  • caseInsensitive : bool - When true, matching is case-insensitive (ASCII only).

  • dotAll : bool - When true, . matches newline characters as well.

7.3.4. Compilation and validation

regex::debug_set(cset: CharSet)

Prints all characters contained in a CharSet for debugging purposes.

Arguments
regex::is_valid(re: Regex) : bool()

Returns true if the compiled regex is valid and ready for matching.

Arguments

7.3.4.1. regex_compile

regex::regex_compile(expr: string; case_insensitive: bool = false; dot_all: bool = false) : Regex()

Compiles a regular expression pattern string into a Regex object. Panics if the pattern is invalid. An overload taking a var re : Regex out-parameter returns bool instead of panicking. Optional flags: case_insensitive=true for ASCII case-insensitive matching, dot_all=true for . to also match newline characters.

Arguments
  • expr : string

  • case_insensitive : bool

  • dot_all : bool

regex::regex_compile(re: Regex; expr: string; case_insensitive: bool = false; dot_all: bool = false) : bool()
regex::regex_compile(re: Regex) : Regex()

regex::regex_debug(regex: Regex)

Prints the internal structure of a compiled regex for debugging purposes.

Arguments
regex::visit_top_down(node: ReNode?; blk: block<(var n:ReNode?):void>)

Visits all nodes of a compiled regex tree in top-down order, invoking a callback for each node.

Arguments

7.3.5. Access

7.3.5.1. Regex[]

regex::Regex[](regex: Regex; index: int) : range()

Returns the match range for the given group index.

Arguments
  • regex : Regex

  • index : int

regex::Regex[](regex: Regex; name: string) : range()

regex::regex_foreach(regex: Regex; str: string; blk: block<(at:range):bool>)

Iterates over all non-overlapping matches of a regex in a string, invoking a block for each match.

Arguments
  • regex : Regex

  • str : string

  • blk : block<(at:range):bool>

regex::regex_group(regex: Regex; index: int; match: string) : string()

Returns the substring captured by the specified group index after a successful match.

Arguments
  • regex : Regex

  • index : int

  • match : string

regex::regex_group_by_name(regex: Regex; name: string; str: string) : string()

Returns the matched substring for the named capturing group (?P<name>...). Returns empty string if the group name is not found.

Arguments
  • regex : Regex

  • name : string

  • str : string

7.3.6. Match & replace

regex::regex_match(regex: Regex; str: string; offset: int = 0) : int()

Matches a compiled regex against a string and returns the end position of the match, or -1 on failure.

Arguments
  • regex : Regex

  • str : string

  • offset : int

regex::regex_match_all(regex: Regex; str: string) : array<range>()

Returns an array of all non-overlapping match ranges for the regular expression in str.

Arguments
  • regex : Regex

  • str : string

7.3.6.1. regex_replace

regex::regex_replace(regex: Regex; str: string; blk: block<(at:string):string>) : string()

Replaces each substring matched by the regex with the result returned by the provided block. An overload accepting a template string is also available, supporting $0/$& for the whole match, $1$9 for numbered groups, ${name} for named groups, and $$ for a literal $.

Arguments
  • regex : Regex

  • str : string

  • blk : block<(at:string):string>

regex::regex_replace(regex: Regex; str: string; replacement: string) : string()

regex::regex_search(regex: Regex; str: string; offset: int = 0) : int2()

Searches for the first occurrence of the regular expression anywhere in str, starting from offset. Returns int2(start, end) on success, or int2(-1, -1) if not found. Unlike regex_match, this function scans the entire string.

Arguments
  • regex : Regex

  • str : string

  • offset : int

regex::regex_split(regex: Regex; str: string) : array<string>()

Splits str by all non-overlapping matches of the regular expression. Returns an array of substrings between matches.

Arguments
  • regex : Regex

  • str : string

7.3.7. Generation

regex::re_gen(re: Regex; rnd: ReGenRandom) : string()

Generates a random string that matches the given compiled regex.

Arguments
regex::re_gen_get_rep_limit() : uint()

Returns the maximum repetition limit used by regex quantifiers during string generation.