5.1.31. Regular Expressions
This tutorial covers daslib/regex and daslib/regex_boost — compiling,
matching, and replacing text with regular expressions in daslang.
regex provides the core compiler, matcher, and iterator APIs.
regex_boost adds the %regex~ reader macro for compile-time patterns.
5.1.31.1. Compiling and matching
regex_compile creates a Regex from a pattern string.
regex_match returns the end position of the match (from position 0),
or -1 on failure:
var re <- regex_compile("hello")
let pos = regex_match(re, "hello world") // 5
let no = regex_match(re, "goodbye") // -1
An optional third argument specifies a starting offset:
var re2 <- regex_compile("world")
let pos2 = regex_match(re2, "hello world", 6) // 11
5.1.31.2. Character classes
Built-in shorthand classes match common character categories:
Escape |
Meaning |
|---|---|
|
Word chars |
|
Non-word chars |
|
Digits |
|
Non-digits |
|
Whitespace (space, tab, newline, CR, form-feed, vertical-tab) |
|
Non-whitespace |
|
Tab |
|
Newline |
|
Carriage return |
Example:
var re_num <- regex_compile("\\d+")
regex_match(re_num, "12345") // 5
var re_ws <- regex_compile("\\s+")
regex_match(re_ws, " x") // 3
5.1.31.3. Anchors
^ anchors the match to the beginning of the string (or offset position).
$ anchors to the end:
var re_start <- regex_compile("^hello")
regex_match(re_start, "hello") // 5
regex_match(re_start, "say hello") // -1
var re_full <- regex_compile("^abc$")
regex_match(re_full, "abc") // 3
regex_match(re_full, "abcd") // -1
5.1.31.4. Quantifiers
Syntax |
Meaning |
|---|---|
|
One or more (greedy) |
|
Zero or more (greedy) |
|
Zero or one |
|
Exactly n repetitions |
|
n or more repetitions (greedy) |
|
Between n and m repetitions (greedy) |
var re_plus <- regex_compile("a+")
regex_match(re_plus, "aaa") // 3
var re_q <- regex_compile("colou?r")
regex_match(re_q, "color") // 5
regex_match(re_q, "colour") // 6
Counted quantifiers use braces (escaped as \{ in daslang strings):
var re_exact <- regex_compile("\\d\{4}")
regex_match(re_exact, "1234") // 4
var re_range <- regex_compile("a\{2,4}")
regex_match(re_range, "a") // -1
regex_match(re_range, "aaa") // 3
regex_match(re_range, "aaaaa") // 4
5.1.31.5. Groups and alternation
Parentheses create capturing groups. | separates alternatives:
var re_alt <- regex_compile("cat|dog")
regex_match(re_alt, "cat") // 3
regex_match(re_alt, "dog") // 3
regex_group retrieves group captures after a successful match:
var re_grp <- regex_compile("(\\w+)@(\\w+)")
let inp = "user@host"
regex_match(re_grp, inp) // 9
print("{regex_group(re_grp, 1, inp)}\n") // user
print("{regex_group(re_grp, 2, inp)}\n") // host
5.1.31.6. Character sets
Square brackets define a set of characters to match:
[abc]— matchesa,b, orc[a-z]— matches a range[^abc]— negated set (matches anything NOT listed)[\d_]— shorthand classes work inside sets
var re_vowel <- regex_compile("[aeiou]+")
regex_match(re_vowel, "aeiou") // 5
var re_neg <- regex_compile("[^0-9]+")
regex_match(re_neg, "abc") // 3
5.1.31.7. Word boundaries
\b matches at a word boundary — the transition between \w and \W
characters, or at the start/end of the string.
\B matches at a non-boundary position:
var re_bnd <- regex_compile("\\bhello\\b")
regex_match(re_bnd, "hello") // 5
var re_nb <- regex_compile("\\Bell")
regex_match(re_nb, "hello", 1) // 4
5.1.31.8. Foreach and replace
regex_foreach iterates over all non-overlapping matches, passing each
match range (as int2) to a block. Return true to continue:
var re_num <- regex_compile("\\d+")
regex_foreach(re_num, "a12b34c56") $(at) {
print("[{at.x},{at.y}] ") // [1,3] [4,6] [7,9]
return true
}
regex_replace replaces every match using a block that receives the
matched substring and returns the replacement:
let result = regex_replace(re_num, "a12b34c56") $(match_str) {
return "X"
}
print("{result}\n") // aXbXcX
5.1.31.9. Escaped metacharacters
Backslash escapes literal metacharacters: \. \+ \* \( \)
\[ \] \| \\ \^ \{ \}:
var re_dot <- regex_compile("\\d+\\.\\d+")
regex_match(re_dot, "3.14") // 4
var re_parens <- regex_compile("\\(\\w+\\)")
regex_match(re_parens, "(hello)") // 7
5.1.31.10. Hex escapes
\xHH matches a character by its hexadecimal code:
var re_hex <- regex_compile("\\x41")
regex_match(re_hex, "A") // 1 (0x41 = 'A')
5.1.31.11. Reader macro
regex_boost provides the %regex~ reader macro which compiles
a pattern at compile time. No double-escaping is needed — backslashes
are literal in the macro body:
require daslib/regex_boost
var re <- %regex~\d+%%
regex_match(re, "42abc") // 2
var re2 <- %regex~[a-z]+%%
regex_match(re2, "hello") // 5
5.1.31.12. Search, split, match_all
regex_search finds the first match anywhere in the string (unlike
regex_match which only matches at position 0). Returns int2(start, end)
or int2(-1, -1):
var re_num <- regex_compile("\\d+")
let pos = regex_search(re_num, "abc 123 def") // int2(4, 7)
regex_split splits a string by pattern matches:
var re_comma <- regex_compile(",\\s*")
var parts <- regex_split(re_comma, "a, b,c, d")
// parts == ["a", "b", "c", "d"]
regex_match_all collects all match ranges:
var re_word <- regex_compile("\\w+")
var matches <- regex_match_all(re_word, "foo bar baz")
// length(matches) == 3
5.1.31.13. Non-capturing groups
(?:...) groups without creating a capture. Useful for applying
quantifiers or alternation without increasing the group count:
var re <- regex_compile("(?:cat|dog)fish")
regex_match(re, "catfish") // 7
length(re.groups) // 1 (only group 0)
var re2 <- regex_compile("(?:ab)\{3}")
regex_match(re2, "ababab") // 6
5.1.31.14. Named groups
(?P<name>...) creates a named capturing group, accessible via
regex_group_by_name:
var re <- regex_compile("(?P<user>\\w+)@(?P<host>\\w+)")
let email = "alice@example"
regex_match(re, email)
regex_group_by_name(re, "user", email) // "alice"
regex_group_by_name(re, "host", email) // "example"
Named groups are also accessible by their numeric index (1, 2, …)
via regex_group.
5.1.31.15. Lazy quantifiers
Greedy quantifiers (+, *, ?, {n,m}) match as much as possible.
Appending ? makes them lazy — matching as little as possible:
Greedy |
Lazy |
Meaning |
|---|---|---|
|
|
One or more (prefer fewer) |
|
|
Zero or more (prefer fewer) |
|
|
Zero or one (prefer zero) |
|
|
Counted (prefer n) |
// lazy +? takes the shortest match
var re <- regex_compile("<.+?>")
let pos = regex_search(re, "<b>bold</b>")
// pos == int2(0, 3) — matches "<b>" not "<b>bold</b>"
// greedy vs lazy at end of pattern
regex_match(regex_compile("a+"), "aaa") // 3 (greedy: all)
regex_match(regex_compile("a+?"), "aaa") // 1 (lazy: one)
5.1.31.16. Practical examples
The %regex~ reader macro avoids double-escaping, making real-world
patterns much more readable.
Phone number validation:
var re_phone <- %regex~^\d{3}-\d{4}$%%
regex_match(re_phone, "555-1234") != -1 // true
regex_match(re_phone, "55-1234") != -1 // false
Strip non-word characters:
var re_strip <- %regex~[^\w]+%%
let cleaned = regex_replace(re_strip, "he!l@l#o") $(m) {
return ""
}
// cleaned == "hello"
Extract email parts:
var re_email <- %regex~([\w.]+)@([\w.]+)%%
let email = "user@example.com"
regex_match(re_email, email)
regex_group(re_email, 1, email) // "user"
regex_group(re_email, 2, email) // "example.com"
IP address pattern:
var re_ip <- %regex~\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}%%
regex_match(re_ip, "192.168.1.1") // 11
5.1.31.17. Case-insensitive matching
Pass case_insensitive=true to regex_compile for ASCII case-insensitive
matching. Character classes and sets are also affected:
// default: case-sensitive
var re <- regex_compile("hello")
regex_match(re, "HELLO") // -1
// case-insensitive
var re_ci <- regex_compile("hello", [case_insensitive=true])
regex_match(re_ci, "HELLO") // 5
regex_match(re_ci, "HeLLo") // 5
// character sets are also case-insensitive
var re_set <- regex_compile("[a-z]+", [case_insensitive=true])
regex_match(re_set, "AbCdE") // 5
5.1.31.18. Dot and newline
By default, . matches any character except newline (\n).
Pass dot_all=true to regex_compile to make . match newlines too:
// default: '.' does NOT match newline
var re <- regex_compile(".+")
regex_match(re, "ab\nc") // 2
// dot_all=true: '.' also matches newline
var re_all <- regex_compile(".+", [dot_all=true])
regex_match(re_all, "ab\nc") // 4
This is useful for multi-line content extraction:
var re <- regex_compile("START(.+?)END", [dot_all=true])
let text = "START\nhello\nEND"
regex_match(re, text)
regex_group(re, 1, text) // "\nhello\n"
5.1.31.19. Lookahead assertions
Lookahead assertions check what follows the current position without consuming any input.
(?=...) is a positive lookahead — the overall match succeeds only if
the lookahead pattern matches:
// "foo" only if followed by "bar"
var re <- regex_compile("foo(?=bar)")
regex_match(re, "foobar") // 3 (matches "foo", not "foobar")
regex_match(re, "foobaz") // -1
// extract digits before " dollars"
var re2 <- regex_compile("\\d+(?= dollars)")
let pos = regex_search(re2, "100 dollars")
// pos == int2(0, 3) — matches "100"
(?!...) is a negative lookahead — the match succeeds only if the
lookahead pattern does NOT match:
// "foo" only if NOT followed by "bar"
var re <- regex_compile("foo(?!bar)")
regex_match(re, "foobar") // -1
regex_match(re, "foobaz") // 3
// single char NOT followed by "!"
var re2 <- regex_compile("\\w(?!!)")
regex_match(re2, "a!") // -1
regex_match(re2, "a.") // 1
5.1.31.20. Template-string replace
regex_replace also accepts a replacement template string instead of a
block. The template supports group references:
Reference |
Meaning |
|---|---|
|
Whole match |
|
Whole match (alternative syntax) |
|
Numbered capturing groups |
|
Named capturing group |
|
Literal |
// swap first and last name
var re <- regex_compile("(\\w+) (\\w+)")
regex_replace(re, "John Smith", "$2 $1") // "Smith John"
// wrap each word in brackets
var re2 <- regex_compile("\\w+")
regex_replace(re2, "hello world", "[$0]") // "[hello] [world]"
Named group references use ${name} syntax:
var re <- regex_compile("(?P<m>\\d+)/(?P<d>\\d+)/(?P<y>\\d+)")
regex_replace(re, "12/25/2024", "${y}-${m}-${d}") // "2024-12-25"
5.1.31.21. Reader macro flags
Flags can be appended after a second ~ in the %regex~ reader macro:
Syntax |
Effect |
|---|---|
|
Case-insensitive matching |
|
Dot-all mode ( |
|
Both flags combined |
var re_ci <- %regex~hello~i%%
regex_match(re_ci, "HELLO") // 5
var re_s <- %regex~.+~s%%
regex_match(re_s, "ab\nc") // 4
var re_is <- %regex~hello.+world~is%%
regex_match(re_is, "Hello\nWorld") // 11
See also
Full source: tutorials/language/31_regex.das
JSON tutorial (previous tutorial).
Next tutorial: Operator overloading.
Regular expression library — core regex module reference.
Boost package for REGEX — regex boost module reference.