3.4. UTF-8 utilities
The UTF8_UTILS module provides Unicode UTF-8 string utilities including character iteration, codepoint extraction, byte length calculation, and validation of UTF-8 encoded text.
All functions and symbols are in “utf8_utils” module, use require to get access to it.
require daslib/utf8_utils
3.4.1. Constants
- utf8_utils::s_utf8d = fixed_array<uint>
Byte-class and state-transition table for the UTF-8 DFA decoder.
- utf8_utils::UTF8_ACCEPT = 0x0
DFA accept state indicating a valid UTF-8 sequence.
3.4.2. Encoding and decoding
utf8_decode (var dest_utf32_string: array<uint>; source_utf8_string: string)
utf8_decode (source_utf8_string: array<uint8>) : array<uint>
utf8_decode (var dest_utf32_string: array<uint>; source_utf8_string: array<uint8>)
utf8_encode (var dest_array: array<uint8>; source_utf32_string: array<uint>)
utf8_encode (source_utf32_string: array<uint>) : array<uint8>
- utf8_utils::decode_unicode_escape(str: string) : string()
Decodes Unicode escape sequences (backslash followed by hex digits) in a string to UTF-8.
- Arguments
str : string
- utf8_utils::utf16_to_utf32(high: uint; low: uint) : uint()
Converts a UTF-16 surrogate pair to a single UTF-32 codepoint.
- Arguments
high : uint
low : uint
3.4.2.1. utf8_decode
- utf8_utils::utf8_decode(source_utf8_string: string) : array<uint>()
Converts UTF-8 string to UTF-32 and returns it as an array of codepoints (UTF-32 string)
- Arguments
source_utf8_string : string
- utf8_utils::utf8_decode(dest_utf32_string: array<uint>; source_utf8_string: string)
- utf8_utils::utf8_decode(source_utf8_string: array<uint8>) : array<uint>()
- utf8_utils::utf8_decode(dest_utf32_string: array<uint>; source_utf8_string: array<uint8>)
3.4.2.2. utf8_encode
- utf8_utils::utf8_encode(dest_array: array<uint8>; source_utf32_string: array<uint>)
Converts UTF-32 string to UTF-8 and appends it to the UTF-8 byte array
- Arguments
dest_array : array<uint8>
source_utf32_string : array<uint> implicit
- utf8_utils::utf8_encode(dest_array: array<uint8>; ch: uint)
- utf8_utils::utf8_encode(ch: uint) : array<uint8>()
- utf8_utils::utf8_encode(source_utf32_string: array<uint>) : array<uint8>()
3.4.3. Length and measurement
3.4.4. Validation
3.4.4.1. contains_utf8_bom
- utf8_utils::contains_utf8_bom(utf8_string: array<uint8>) : bool()
Returns true if the byte array starts with a UTF-8 BOM (byte order mark).
- Arguments
utf8_string : array<uint8> implicit
- utf8_utils::contains_utf8_bom(utf8_string: string) : bool()
- utf8_utils::is_first_byte_of_utf8_char(ch: uint8) : bool()
Returns true if the given byte is the first byte of a UTF-8 character.
- Arguments
ch : uint8