Unicode Strings

August 18th 2025

Tags:

C

JMPL

Programming Language

For the past week I've been working on a refactor to the internals of strings in JMPL. As this hasn't changed much on the front-end, I thought I'd do a more technical report on the refactor for this week's devlog.

What was wrong?

Initially, strings in JMPL were stored as a byte array object. This is fine for ASCII text, but as JMPL documents are encoded with UTF-8, this causes errors for non-ASCII characters. The main issues arise when indexing or getting the size of strings. Take this example:

let str = "I♥C"

This string contains 3 Unicode characters (notated by code points U+XXXX) but in UTF-8, this would be encoded as:

I // Unicode: U+0049, UTF-8: 0x49 or 01001001

♥ // Unicode: U+2665, UTF-8: 0xE299A5 or 11100010 10011001 10100101

C // Unicode: U+0043, UTF-8: 0x43 or 01000011

01001001 11100010 10011001 10100101 01000011

If you were to get the length of this string, you'd get 5, which is obviously not the correct number of characters. Similary if you tried to get the character at the third position, you would get an invalid character.

How was this fixed?

The fix to this is to store the characters Unicode code points along side it's bytes. This way, you can easily index or get the length of the string using the code point array instead of the byte array. Inspired by Python's Unicode Object, I did just that with a few clever optimisations.

Most of the time, especially in English speaking countries, ASCII text is enough. A Unicode code point (represented as hexidecimal) can go from U+0000 to U+10FFFF with ASCII characters taking up the range U+0000 - U+007F. If we stored all code points with 6 bytes, we would be wasting 4 bytes when storing ASCII text.

The trick here is to store 3 arrays as a union: one for when characters are in the range U+0000 - U+00FF which take up 1 byte, one for when characters are in the range U+0000 - U+FFFF which take up to 2 bytes, and one to accomodate any character. The last array uses 4 bytes to encode characters.

Now to print the strings, we must have the UTF-8 encoded bytes. This is not a problem as reading the JMPL file containing the string is already encoded, so we can directly copy the bytes into the string object. The real challenge is decoding the UTF-8 bytes into code points.

When decoding the string we first have to work out its kind so we know which array to store the code points in. These kinds are 1 byte, 2 byte, 4 byte, and a special one for ASCII text. To work out a string's kind, we use the header of each character. In UTF-8, the first half of the first byte is used as a header which shows the computer how many bytes make up the character. If we scan through the string and read each header, we can work out the kind of the string by finding the character with the largest header.

In practice, the minimum size of the code points does not always align with the UTF-8 headers. This just means I had to work out what the exact headers were.

Characters U+0000 - U+007F (ASCII) are encoded with 1 byte and have a 1 byte code point.

The header of all characters in this range is < 0x80.

Characters U+0080 - U+00FF are encoded with 2 bytes and have a 1 byte code point.

The header of all characters in this range is < 0xC4.

Characters U+0100 - U+07FF are encoded with 2 bytes and have a 2 byte code point.

The header of all characters in this range is < 0xE0.

Characters U+0800 - U+FFFF are encoded with 3 bytes and have a 2 byte code point.

The header of all characters in this range is <= 0xEF.

Characters U+100000 - U+10FFFF are encoded with 4 bytes and have a 4 byte code point.

The header of all characters in this range is > 0xEF.

Using this information, the kind of a code point array can be found. The next step is to translate each UTF-8 encoded character to a code point. As I couldn't find any information on the algorithm to do this, my (non-perfect) algorithm can be found here. It uses the notation descriped by the Wikipedia page on UTF-8 found here. It works by generating each byte of the code point (stored as a uint8_t) individually and OR-ing them together to create the final code point. For smaller sizes of code point, the result can be casted down to a uint8_t or uint16_t.

Now whenever details such as a character at an index or the length of the string need to be found, the code point array can be used.

JMPL Devlog #7

Unicode Strings

C

JMPL

Programming Language

What was wrong?

How was this fixed?