5.1.31. Regular Expressions

This tutorial covers daslib/regex and daslib/regex_boost — compiling, matching, and replacing text with regular expressions in daslang.

regex provides the core compiler, matcher, and iterator APIs. regex_boost adds the %regex~ reader macro for compile-time patterns.

5.1.31.1. Compiling and matching

regex_compile creates a Regex from a pattern string. regex_match returns the end position of the match (from position 0), or -1 on failure:

var re <- regex_compile("hello")
let pos = regex_match(re, "hello world")   // 5
let no  = regex_match(re, "goodbye")       // -1

An optional third argument specifies a starting offset:

var re2 <- regex_compile("world")
let pos2 = regex_match(re2, "hello world", 6)   // 11

5.1.31.2. Character classes

Built-in shorthand classes match common character categories:

Escape

Meaning

\w

Word chars [a-zA-Z0-9_]

\W

Non-word chars

\d

Digits [0-9]

\D

Non-digits

\s

Whitespace (space, tab, newline, CR, form-feed, vertical-tab)

\S

Non-whitespace

\t

Tab

\n

Newline

\r

Carriage return

Example:

var re_num <- regex_compile("\\d+")
regex_match(re_num, "12345")    // 5

var re_ws <- regex_compile("\\s+")
regex_match(re_ws, "   x")     // 3

5.1.31.3. Anchors

^ anchors the match to the beginning of the string (or offset position). $ anchors to the end:

var re_start <- regex_compile("^hello")
regex_match(re_start, "hello")       // 5
regex_match(re_start, "say hello")   // -1

var re_full <- regex_compile("^abc$")
regex_match(re_full, "abc")          // 3
regex_match(re_full, "abcd")         // -1

5.1.31.4. Quantifiers

Syntax

Meaning

+

One or more (greedy)

*

Zero or more (greedy)

?

Zero or one

{n}

Exactly n repetitions

{n,}

n or more repetitions (greedy)

{n,m}

Between n and m repetitions (greedy)

var re_plus <- regex_compile("a+")
regex_match(re_plus, "aaa")       // 3

var re_q <- regex_compile("colou?r")
regex_match(re_q, "color")        // 5
regex_match(re_q, "colour")       // 6

Counted quantifiers use braces (escaped as \{ in daslang strings):

var re_exact <- regex_compile("\\d\{4}")
regex_match(re_exact, "1234")     // 4

var re_range <- regex_compile("a\{2,4}")
regex_match(re_range, "a")        // -1
regex_match(re_range, "aaa")      // 3
regex_match(re_range, "aaaaa")    // 4

5.1.31.5. Groups and alternation

Parentheses create capturing groups. | separates alternatives:

var re_alt <- regex_compile("cat|dog")
regex_match(re_alt, "cat")    // 3
regex_match(re_alt, "dog")    // 3

regex_group retrieves group captures after a successful match:

var re_grp <- regex_compile("(\\w+)@(\\w+)")
let inp = "user@host"
regex_match(re_grp, inp)                  // 9
print("{regex_group(re_grp, 1, inp)}\n")  // user
print("{regex_group(re_grp, 2, inp)}\n")  // host

5.1.31.6. Character sets

Square brackets define a set of characters to match:

  • [abc] — matches a, b, or c

  • [a-z] — matches a range

  • [^abc] — negated set (matches anything NOT listed)

  • [\d_] — shorthand classes work inside sets

var re_vowel <- regex_compile("[aeiou]+")
regex_match(re_vowel, "aeiou")    // 5

var re_neg <- regex_compile("[^0-9]+")
regex_match(re_neg, "abc")        // 3

5.1.31.7. Word boundaries

\b matches at a word boundary — the transition between \w and \W characters, or at the start/end of the string. \B matches at a non-boundary position:

var re_bnd <- regex_compile("\\bhello\\b")
regex_match(re_bnd, "hello")       // 5

var re_nb <- regex_compile("\\Bell")
regex_match(re_nb, "hello", 1)     // 4

5.1.31.8. Foreach and replace

regex_foreach iterates over all non-overlapping matches, passing each match range (as int2) to a block. Return true to continue:

var re_num <- regex_compile("\\d+")
regex_foreach(re_num, "a12b34c56") $(at) {
    print("[{at.x},{at.y}] ")   // [1,3] [4,6] [7,9]
    return true
}

regex_replace replaces every match using a block that receives the matched substring and returns the replacement:

let result = regex_replace(re_num, "a12b34c56") $(match_str) {
    return "X"
}
print("{result}\n")   // aXbXcX

5.1.31.9. Escaped metacharacters

Backslash escapes literal metacharacters: \. \+ \* \( \) \[ \] \| \\ \^ \{ \}:

var re_dot <- regex_compile("\\d+\\.\\d+")
regex_match(re_dot, "3.14")       // 4

var re_parens <- regex_compile("\\(\\w+\\)")
regex_match(re_parens, "(hello)")   // 7

5.1.31.10. Hex escapes

\xHH matches a character by its hexadecimal code:

var re_hex <- regex_compile("\\x41")
regex_match(re_hex, "A")    // 1  (0x41 = 'A')

5.1.31.11. Reader macro

regex_boost provides the %regex~ reader macro which compiles a pattern at compile time. No double-escaping is needed — backslashes are literal in the macro body:

require daslib/regex_boost

var re <- %regex~\d+%%
regex_match(re, "42abc")    // 2

var re2 <- %regex~[a-z]+%%
regex_match(re2, "hello")   // 5

5.1.31.12. Search, split, match_all

regex_search finds the first match anywhere in the string (unlike regex_match which only matches at position 0). Returns int2(start, end) or int2(-1, -1):

var re_num <- regex_compile("\\d+")
let pos = regex_search(re_num, "abc 123 def")   // int2(4, 7)

regex_split splits a string by pattern matches:

var re_comma <- regex_compile(",\\s*")
var parts <- regex_split(re_comma, "a, b,c, d")
// parts == ["a", "b", "c", "d"]

regex_match_all collects all match ranges:

var re_word <- regex_compile("\\w+")
var matches <- regex_match_all(re_word, "foo bar baz")
// length(matches) == 3

5.1.31.13. Non-capturing groups

(?:...) groups without creating a capture. Useful for applying quantifiers or alternation without increasing the group count:

var re <- regex_compile("(?:cat|dog)fish")
regex_match(re, "catfish")     // 7
length(re.groups)              // 1 (only group 0)

var re2 <- regex_compile("(?:ab)\{3}")
regex_match(re2, "ababab")     // 6

5.1.31.14. Named groups

(?P<name>...) creates a named capturing group, accessible via regex_group_by_name:

var re <- regex_compile("(?P<user>\\w+)@(?P<host>\\w+)")
let email = "alice@example"
regex_match(re, email)
regex_group_by_name(re, "user", email)   // "alice"
regex_group_by_name(re, "host", email)   // "example"

Named groups are also accessible by their numeric index (1, 2, …) via regex_group.

5.1.31.15. Lazy quantifiers

Greedy quantifiers (+, *, ?, {n,m}) match as much as possible. Appending ? makes them lazy — matching as little as possible:

Greedy

Lazy

Meaning

+

+?

One or more (prefer fewer)

*

*?

Zero or more (prefer fewer)

?

??

Zero or one (prefer zero)

{n,m}

{n,m}?

Counted (prefer n)

// lazy +? takes the shortest match
var re <- regex_compile("<.+?>")
let pos = regex_search(re, "<b>bold</b>")
// pos == int2(0, 3) — matches "<b>" not "<b>bold</b>"

// greedy vs lazy at end of pattern
regex_match(regex_compile("a+"),  "aaa")   // 3 (greedy: all)
regex_match(regex_compile("a+?"), "aaa")   // 1 (lazy: one)

5.1.31.16. Practical examples

The %regex~ reader macro avoids double-escaping, making real-world patterns much more readable.

Phone number validation:

var re_phone <- %regex~^\d{3}-\d{4}$%%
regex_match(re_phone, "555-1234") != -1   // true
regex_match(re_phone, "55-1234")  != -1   // false

Strip non-word characters:

var re_strip <- %regex~[^\w]+%%
let cleaned = regex_replace(re_strip, "he!l@l#o") $(m) {
    return ""
}
// cleaned == "hello"

Extract email parts:

var re_email <- %regex~([\w.]+)@([\w.]+)%%
let email = "user@example.com"
regex_match(re_email, email)
regex_group(re_email, 1, email)   // "user"
regex_group(re_email, 2, email)   // "example.com"

IP address pattern:

var re_ip <- %regex~\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}%%
regex_match(re_ip, "192.168.1.1")   // 11

5.1.31.17. Case-insensitive matching

Pass case_insensitive=true to regex_compile for ASCII case-insensitive matching. Character classes and sets are also affected:

// default: case-sensitive
var re <- regex_compile("hello")
regex_match(re, "HELLO")    // -1

// case-insensitive
var re_ci <- regex_compile("hello", [case_insensitive=true])
regex_match(re_ci, "HELLO")    // 5
regex_match(re_ci, "HeLLo")   // 5

// character sets are also case-insensitive
var re_set <- regex_compile("[a-z]+", [case_insensitive=true])
regex_match(re_set, "AbCdE")   // 5

5.1.31.18. Dot and newline

By default, . matches any character except newline (\n). Pass dot_all=true to regex_compile to make . match newlines too:

// default: '.' does NOT match newline
var re <- regex_compile(".+")
regex_match(re, "ab\nc")     // 2

// dot_all=true: '.' also matches newline
var re_all <- regex_compile(".+", [dot_all=true])
regex_match(re_all, "ab\nc")   // 4

This is useful for multi-line content extraction:

var re <- regex_compile("START(.+?)END", [dot_all=true])
let text = "START\nhello\nEND"
regex_match(re, text)
regex_group(re, 1, text)   // "\nhello\n"

5.1.31.19. Lookahead assertions

Lookahead assertions check what follows the current position without consuming any input.

(?=...) is a positive lookahead — the overall match succeeds only if the lookahead pattern matches:

// "foo" only if followed by "bar"
var re <- regex_compile("foo(?=bar)")
regex_match(re, "foobar")   // 3 (matches "foo", not "foobar")
regex_match(re, "foobaz")   // -1

// extract digits before " dollars"
var re2 <- regex_compile("\\d+(?= dollars)")
let pos = regex_search(re2, "100 dollars")
// pos == int2(0, 3) — matches "100"

(?!...) is a negative lookahead — the match succeeds only if the lookahead pattern does NOT match:

// "foo" only if NOT followed by "bar"
var re <- regex_compile("foo(?!bar)")
regex_match(re, "foobar")   // -1
regex_match(re, "foobaz")   // 3

// single char NOT followed by "!"
var re2 <- regex_compile("\\w(?!!)")
regex_match(re2, "a!")   // -1
regex_match(re2, "a.")   // 1

5.1.31.20. Template-string replace

regex_replace also accepts a replacement template string instead of a block. The template supports group references:

Reference

Meaning

$0

Whole match

$&

Whole match (alternative syntax)

$1$9

Numbered capturing groups

${name}

Named capturing group

$$

Literal $ character

// swap first and last name
var re <- regex_compile("(\\w+) (\\w+)")
regex_replace(re, "John Smith", "$2 $1")     // "Smith John"

// wrap each word in brackets
var re2 <- regex_compile("\\w+")
regex_replace(re2, "hello world", "[$0]")    // "[hello] [world]"

Named group references use ${name} syntax:

var re <- regex_compile("(?P<m>\\d+)/(?P<d>\\d+)/(?P<y>\\d+)")
regex_replace(re, "12/25/2024", "${y}-${m}-${d}")   // "2024-12-25"

5.1.31.21. Reader macro flags

Flags can be appended after a second ~ in the %regex~ reader macro:

Syntax

Effect

%regex~pattern~i%%

Case-insensitive matching

%regex~pattern~s%%

Dot-all mode (. matches \n)

%regex~pattern~is%%

Both flags combined

var re_ci <- %regex~hello~i%%
regex_match(re_ci, "HELLO")   // 5

var re_s <- %regex~.+~s%%
regex_match(re_s, "ab\nc")    // 4

var re_is <- %regex~hello.+world~is%%
regex_match(re_is, "Hello\nWorld")   // 11

See also

Full source: tutorials/language/31_regex.das

JSON tutorial (previous tutorial).

Next tutorial: Operator overloading.

Regular expression library — core regex module reference.

Boost package for REGEX — regex boost module reference.