How to use UTF8 strings with Luart
By Samir Tine, updated on January 2023
Luart has native support for UTF8 strings, meaning you don't have to bother with specific modules or dependencies : just use accented characters in your Lua strings, and you're done ! In fact, this is more complex under the hood...
Lua strings are not strings !
Yes, you read that right. It could have been worthy of a shakespear play but when it comes to strings in Lua, it can turn into a nightmare if you don't pay attention with the true definition of what a string is...
Wait, what do you mean by strings are not strings ?
In Lua, strings could have been called buffers, arrays, or even containers. Indeed strings are only containers. But it does not contain characters. In fact, Lua has no idea what a character is.
So what a string contains ?
You may then wonder what the strings contain: they simply contain
bytes. Strings can therefore, in Lua, contain lots of things: an image, a digitized sound, a database ... and characters too.
This is where things get complicated: Lua considers that in strings, a single byte corresponds to a single character. This is fine as long as you are using single-byte encoded characters (as with ASCII encoding), with only 255 character possibilities.
But in the age of the Internet, when the whole world communicates in all languages, that seems rather restrictive ! Fortunately other encodings than ASCII exist, to extend the number of usable characters: UTF8, UCS 2 LE, UCS 2 BE,... They allow to encode a character over several bytes.
Multibyte characters with standard Lua
Multibytes characters can be stored in Lua strings after all, as strings in Lua contains bytes.
Yes, that's right. But it does not mean you can use them ! In fact, some functions in Lua
string module won't work as they expect to work on single byte characters.
One rule to rule them all
Lua strings functionnalities (concatenation, length calculation,
string.sub...) consider that strings contain only single byte characters : the same rule again!
Here is an example that illustrates the problem when using standard Lua (the script must have been saved with UTF8 encoding +/- BOM)
local summer_infrench = "été" -- outputs 5 !? print(string.len(summer_infrench)) -- pos = 3 !? pos = string.find(summer_infrench, "t"))
What's going on ?
Remember the rule : strings are considered as bytes containers. The UTF8 string "été" (means "summer" in French) is 3 characters long, but occupies 5 bytes in memory :
|é||t||é||= 3 characters|
||= 5 bytes|
That's why the function
string.len returns 5 and not 3.
The same for
string.find : The byte position of the "t" character is 3.
The standard "utf8" module
Hopefully, yes there is a workaround. Since Lua 5.3 a new
uft8module is available to help developers with UTF8 encoded strings. But this greatly complicates the use of UTF8 strings, as it uses specific functions. A kind of overlay over strings. Not very friendly : in other modern programming languages, strings are containers for characters and support natively multibytes encodings.
Here is the previous example using the
utf8 module :
local utf8 = require "utf8" local summer_infrench = "été" -- yes ! outputs 3 ! print(utf8.len(summer_infrench)) -- Still pos = 3, no solution for string.find with UTF8 strings pos = string.find(summer_infrench, "t"))
But as you can see, this module is no help when using most of the strings pattern matching functions.
If you want to use UTF8 strings with standard Lua, you will have to use a binary module dependency.
It's in Lua philosophy : if Lua lacks something, implement it using binary modules or Lua modules. Search on the net and you will find some of them. But again, for such a simple functionality, this represents a certain degree of complication especially for beginners.
Luart and multibytes character strings
Luart implements UTF8 strings natively, without any other dependencies : just use accented characters in your strings. All the modules of the runtime library supports UTF8 strings.
The string module in Luart provides specific functions for UTF8 string manipulation, prefixed by
u, such as finding and extracting substrings, and pattern matching.
Here is the previous example with Luart :
-- no need for the "utf8" module local summer_infrench = "été" -- yes ! outputs 3 ! -- string.ulen() => string.len() for UTF8 strings print(string.ulen(summer_infrench)) -- Yes, pos = 2 ! -- string.ufind() => string.find() for UTF8 strings pos = string.ufind(summer_infrench, "t")) -- You can still get string length in bytes with the # operator -- bytelen = 5 bytelen = #summer_infrench
When you use strings to store binary data, you can use the instead the Buffer object.