utf8

This library provides basic support for UTF-8 encoding. This library does not provide any support for Unicode other than the handling of the encoding. Any operation that needs the meaning of a character, such as character classification, is outside its scope.

Unless stated otherwise, all functions that expect a byte position as a parameter assume that the given position is either the start of a byte sequence or one plus the length of the subject string. As in the string library, negative indices count from the end of the string.

You can find a large catalog of usable UTF-8characters here.

Functions

string utf8.char ( Tuple codepoints )
Receives zero or more codepoints as integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.

function , string , int utf8.codes ( string str )
Returns an iterator function so that the construction: for position, codepoint in utf8.codes(str) do -- body end will iterate over all codepoints in string `str`. It raises an error if it meets any invalid byte sequence.

Tuple utf8.codepoint ( string str, int i = 1, int j = i )
Returns the codepoints (as integers) from all codepoints in the provided string (str) that start between byte positions i and j (both included). The default for i is 1 and for j is i. It raises an error if it meets any invalid byte sequence.

int utf8.len ( string s, int i = 1, int j = -1 )
Returns the number of UTF-8 codepoints in the string str that start between positions i and j (both inclusive). The default for i is 1 and for j is -1. If it finds any invalid byte sequence, returns a nil value plus the position of the first invalid byte.

int utf8.offset ( string s, int n, int i = 1 )
Returns the position (in bytes) where the encoding of the n-th codepoint of s (counting from byte position i) starts. A negative n gets characters before position i. The default for i is 1 when n is non-negative and #s + 1 otherwise, so that utf8.offset(s, -n) gets the offset of the n-th character from the end of the string. If the specified character is neither in the subject nor right after its end, the function returns nil.

function utf8.graphemes ( string str, number i, number j )
Returns an iterator function so that for first, last in utf8.graphemes(str) do local grapheme = s:sub(first, last) -- body end will iterate the grapheme clusters of the string.

string utf8.nfcnormalize ( string str )
Converts the input string to Normal Form C, which tries to convert decomposed characters into composed characters.

string utf8.nfdnormalize ( string str )
Converts the input string to Normal Form D, which tries to break up composed characters into decomposed characters.

Constants

string utf8.charpattern
The pattern `"[%z\x01-\x7F\xC2-\xF4][\x80-\xBF]*"`, which matches exactly zero or more UTF-8 byte sequence, assuming that the subject is a valid UTF-8 string.