3.4. UTF-8 utilities

The UTF8_UTILS module provides Unicode UTF-8 string utilities including character iteration, codepoint extraction, byte length calculation, and validation of UTF-8 encoded text.

All functions and symbols are in “utf8_utils” module, use require to get access to it.

require daslib/utf8_utils

3.4.1. Constants

utf8_utils::s_utf8d = fixed_array<uint>

Byte-class and state-transition table for the UTF-8 DFA decoder.

utf8_utils::UTF8_ACCEPT = 0x0

DFA accept state indicating a valid UTF-8 sequence.

3.4.2. Encoding and decoding

utf8_utils::decode_unicode_escape(str: string) : string()

Decodes Unicode escape sequences (backslash followed by hex digits) in a string to UTF-8.

Arguments
  • str : string

utf8_utils::utf16_to_utf32(high: uint; low: uint) : uint()

Converts a UTF-16 surrogate pair to a single UTF-32 codepoint.

Arguments
  • high : uint

  • low : uint

3.4.2.1. utf8_decode

utf8_utils::utf8_decode(source_utf8_string: string) : array<uint>()

Converts UTF-8 string to UTF-32 and returns it as an array of codepoints (UTF-32 string)

Arguments
  • source_utf8_string : string

utf8_utils::utf8_decode(dest_utf32_string: array<uint>; source_utf8_string: string)
utf8_utils::utf8_decode(source_utf8_string: array<uint8>) : array<uint>()
utf8_utils::utf8_decode(dest_utf32_string: array<uint>; source_utf8_string: array<uint8>)

3.4.2.2. utf8_encode

utf8_utils::utf8_encode(dest_array: array<uint8>; source_utf32_string: array<uint>)

Converts UTF-32 string to UTF-8 and appends it to the UTF-8 byte array

Arguments
  • dest_array : array<uint8>

  • source_utf32_string : array<uint> implicit

utf8_utils::utf8_encode(dest_array: array<uint8>; ch: uint)
utf8_utils::utf8_encode(ch: uint) : array<uint8>()
utf8_utils::utf8_encode(source_utf32_string: array<uint>) : array<uint8>()

3.4.3. Length and measurement

3.4.3.1. utf8_length

utf8_utils::utf8_length(utf8_string: string) : int()

Returns the number of characters in the UTF-8 string

Arguments
  • utf8_string : string

utf8_utils::utf8_length(utf8_string: array<uint8>) : int()

3.4.4. Validation

3.4.4.1. contains_utf8_bom

utf8_utils::contains_utf8_bom(utf8_string: array<uint8>) : bool()

Returns true if the byte array starts with a UTF-8 BOM (byte order mark).

Arguments
  • utf8_string : array<uint8> implicit

utf8_utils::contains_utf8_bom(utf8_string: string) : bool()

utf8_utils::is_first_byte_of_utf8_char(ch: uint8) : bool()

Returns true if the given byte is the first byte of a UTF-8 character.

Arguments
  • ch : uint8

3.4.4.2. is_utf8_string_valid

utf8_utils::is_utf8_string_valid(utf8_string: array<uint8>) : bool()

Returns true if the byte array contains a valid UTF-8 encoded string.

Arguments
  • utf8_string : array<uint8> implicit

utf8_utils::is_utf8_string_valid(utf8_string: string) : bool()