Advice on Internationalization

Numerous Advice to the FOSS Community, and a Short History Lesson on Languages!

Months ago I tried contributing some traditional Chinese translations to a random project out there. I don't even remember which project it was to be honest, but I remember that my translations ended up getting rejected because the maintainers of the project decided that they will "convert" simplified Chinese into traditional Chinese directly instead.

Well that clearly shows that 1. they don't actually understand Chinese, and 2. they don't understand Chinese localization.


I kind of assumed everyone reading here either knows the basics and are uninterested with the history, but just in case:
(the important stuff is right after the next horizontal divider titled "i18n is a Mess!")

What is i18n and l10n?

From Wikipedia:

In computing, internationalization and localization (American) or internationalisation and localisation (British), often abbreviated i18n and l10n respectively, are means of adapting computer software to different languages, regional peculiarities and technical requirements of a target locale.

Internationalization is the process of designing a software application so that it can be adapted to various languages and regions without engineering changes. Localization is the process of adapting internationalized software for a specific region or language by translating text and adding locale-specific components.

Localization (which is potentially performed multiple times, for different locales) uses the infrastructure or flexibility provided by internationalization (which is ideally performed only once before localization, or as an integral part of ongoing development).
💡
We call internationalization and localization by shorter names, i18n and l10n respectively. The numbers represent the number of letters between the first and last letter of the word.

Why Localize?

At Fyra Labs and the open source community in general, we believe software should be free, free as in libre. This means we believe all people around the globe should be able to use software freely without any discriminations and difficulties, including language barriers. I mean, of course you can't expect a random 90-year-old grandma to be able to use the command line freely, but it's a very sensible assumption that software should be translated into different languages such that it is accessible to everyone.

First: The Obvious Problem with CJKV

Chinese and Japanese are both pretty complex languages, as in they both have a lot of different characters that are hard to digitally store and represent, causing Unicode to include around (more than?) 100,000 of these characters (CJK Unified Ideographs + the ones that are not unified).

Apparently Japanese was originally a language without any writing systems whatsoever, so they ended up borrowing the ones from China. Right. So creative. Now they're stuck with this forever. Uhh… right we are stuck with this forever right I forgot I'm from Hong Kong for a moment (sigh).

Even though Vietnamese is now written in Latin (with diacritics, tones, so on), it was historically written in those characters too which are known as 𡨸喃 (Chữ Nôm), so a portion of those 100,000 characters are actually just Chữ Nôm like 㗂. (碎空吶越!) These characters are… practically never used but Unicode ended up including them anyway because sure why not… "historical reasons"! ;)

The same goes for Korean too. Unfortunately I never dug into this rabbit hole, so no story for ya!

Second: Character Styles

The second obvious problem is China decided the original Chinese characters they had been using were too hard so they stopped using them. It kind of makes sense because the economy was so bad at that time due to constant war, to the point most people just didn't know how to write. Their solution at the time was to start writing Chinese characters in Latin.

No I'm just kidding. They of course ended up simplifying the commonly-used characters so its easier for people to learn. And of course they're now stuck with this forever.

Since basically other parts of the world that use Chinese (Hong Kong, Macau, Taiwan, Singapore, etc.) are not affected, traditional Chinese still exists as is. So obviously when Unicode came around they also had to pick up all these characters and include both.

And of course they didn't do this correctly. How dare could you expect that!

Turns out a lot of characters are not written in the same way across regions. Some of these problematic characters share the same glyph (返, 拔), some of these don't (啓/啟). Fonts from different regions therefore will show different styles of these glyphs. The worst part is, sometimes you will encounter fonts that don't implement a character (產/産) because the font is from a region that doesn't use that writing style. This will cause that specific character to use another font, which will look off compared to other characters. Imagine reading an English article displayed in Times New Roman but only the character e is displayed using Noto Sans. Horrible!

Well I'm not saying Unicode was wrong, you can't really blame this on Unicode… I think they did their best to be honest. It's complicated.


i18n is a Mess!

I think it's very clear to every one of us that Chinese, Japanese, Korean and Vietnamese are very very different languages. They are quite literally mutually unintelligible. That's the same as what is going on with Latin: knowing English doesn't mean you just magically know other languages that use the same (or similar) writing system.

Well, plot twist! The above statement is false, because Chinese is not a language, it is a language family. There are numerous different languages that belong to Chinese, all of which are, of course, mutually unintelligible (sigh). The most popular one is Mandarin Chinese. It is the official written language used across China and most if not all Chinese regions. It is also the sole official spoken language used in mainland China, Taiwan (with some regional differences) and Singapore (with of course some regional differences).

In Hong Kong, there are 3 official spoken languages, namely Mandarin Chinese, Yue Chinese (one of its variety is Cantonese), and English. In Macau, they are instead Mandarin Chinese, Yue Chinese and Portuguese. We still write in traditional Mandarin Chinese though.

"Regional Differences"

So I'm pretty sure many of you guys are wondering how large the regional differences are. Let's give you guys some examples: (for mainland, simplified/traditional; differences are in bold; ← means same as left)

English Mainland Hong Kong Taiwan
(Browser) tabs 标签/標籤 分頁
Digital 数字/數字
Software 软件/軟件
User account 账户/賬戶 使用者

Sources: Firefox translations and Wikipedia

Well obviously there are some regional differences, but are regional vocabularies understandable for a speaker from another region? Yes, but:

  • They'll know it's not written by a native
  • Some of them will have a hard time understanding it
  • Most of them will think it's weird and will be maybe somewhat confused and annoyed

Auto-Conversion between Traditional and Simplified

There are programs out there that supports conversion between traditional and simplified Chinese characters. These programs can be classified into around 3 types:

  1. Character-to-character conversion. This is often the worst because 1 simplified character might correspond to multiple traditional ones (and the same goes in reverse too). Examples include 钱锺(not 钟)书, 頭髮(not 發), etc.
  2. Word-to-word conversion. This is a little bit better than character-to-character conversion (#1), but if there's a mistake, you are screwed. This is the conversion used in Mandarin Wikipedia. Unlike #1, this conversion can account for specific regional differences (like the user account example above).
    Example: (from Wikipedia) supposedly there is a conversion for (simplified) 内存 ⇒ (traditional) 記憶體 (both meaning RAM as a noun). Consider the following sentence in simplified:
    人体*内存*在很多微生物 (inside a human, there exist many microorganisms)
    Since Mandarin doesn't have spaces for splitting words apart, the conversion system might treat 内存 (inside + exist) as a word when in reality it is actually 2 separate words. This might result in gibberish conversion output:
    人體*記憶體*在很多微生物 (human RAM many microorganisms…?)
  3. Smarter and more accurate conversions that also take care of regional differences. Good luck finding any online tools for this. If you want to achieve this level, maybe it's easier to just find a native speaker.
💡
It is usually a good idea to at least split Mandarin Chinese into zh-Hans (simplified Chinese) and zh-Hant (traditional Chinese) when asking for translations on platforms like Weblate; avoid automatically converting from one to another.
But it is better to account for different regions (zh-Hant-HK, zh-Hans-SG, etc.).

Language Fallbacks

Suppose you are creating a webpage that supports English and different Arabic varieties. You declared en-US, arb (Modern Standard Arabic), and ary (Moroccan Arabic) in your i18n module. When a Moroccan Arabic speaker uses your website, your website outputs texts in Moroccan Arabic. Great! Now imagine that some texts are untranslated in Moroccan Arabic. Would you prefer falling-back to en-US or arb? The obvious answer is arb, but there are many i18n modules out there that:

  • don't do that by default
  • straight up hard-coded it to en-US
  • don't support other fallbacks; and/or
  • they do not support multiple fallback languages depending on the language choices provided by the browser.

It is undesirable to fallback to English when the user wants a variety of Arabic, but at the same time falling-back to Arabic for Mandarin doesn't make sense either.

In my opinion this is one of the biggest flaws in many i18n libraries/modules out there. A good i18n library should have the ability to fallback to the correct language for any given language with (partial) missing translations. Not having this kind of logic coded into the library creates more burden on the software maintainers and the translators.

⚠️
I urge all i18n library/module developers to add support for language fallbacks. When there are missing translations, the fallback for that specific language should be used.

Common fallback logics include:

  • Arabic (all variants → arbar-AE)
  • Chinese (zh-Hant-*zh-Hantzh-Hant-TW/zh-TW; zh-Hans-SGzh-Hans-CN/zh-CN, yue-*/other Chinese languageszh-Hant)
  • English (en-*en-US)
  • Spanish (es-*/spa-*es-ES)
    • It might also make sense to fallback to Portuguese (and the other way around)
    • Exception: don't fallback pt-BR to pt-PT because they are only partly intelligible.
  • etc.

It is usually a good idea for language variants (xx-*) to fallback to the meta language code (xx). And finally, all languages should fallback to English as a last resort.

Fonts

N.B.: This section mostly apply to websites.

Fonts are also pretty important because you want to make sure texts are displayed correctly for different regional languages. This is usually a problem with Asian texts which require fonts that support CJKV characters. For example, zh-CN should use Noto Sans CJK SC, zh-TW should use Noto Sans CJK TW, zh-HK should use Noto Sans CJK HK, and Japanese ja-JP should use Noto Sans CJK JP.

It is also a very good idea to supply a list of fonts instead of just a single one. This can easily be done inside CSS. For example:

body {
  font-family: Arial, 'Noto Sans CJK JP',
               'Noto Sans', HanaMinA, sans-serif;
}

(HanaMinA (Hanazono Mincho) covers all CJK Unified and Compatibility Ideographs up to Extension F and IVD 2017-12-12. This is probably overpowered for most cases.)

Remember to supply fonts that do not support Asian texts first, then gradually include fonts with larger glyph coverages. This will allow fonts on the front with higher priority to be used.

Sometimes you may also need to set the correct locale for the font settings to take effect. For example in LibreOffice, the variant glyphs will only be shown once the user sets the text language to the correct one (zh-Hant-HK), even if the correct font (Noto Sans CJK HK) is applied.

Monospace Fonts

If for some reason you need to show different languages under monospace fonts, you will instantly notice that the spacing will mess up:

一隻字
abcdef

In this case, you probably want to provide the correct monospace fonts for specific languages (e.g. Sarasa Fixed).

For reference, my Neovide font configuration is JetBrainsMono Nerd Font,CaskaydiaCove Nerd Font Mono,Sarasa Mono Slab TC.

For other languages, fonts like Noto Sans Mono should suffice.

Input Fields that Decline IME Inputs

Some websites / applications may modify behaviours of their input fields, which may cause multilingual users and virtual keyboard users unable to input text (e.g. using their IME, or applying auto-suggestions).

The solution is to… avoid doing that unless you have a very good reason. ;)

⚠️
Do not modify the text or manipulate the cursor position when dealing with text input. You should only use immutable methods provided by the toolkit to obtain text inside an input field.

Some IMEs require the use of Enter, Tab, Backspace, etc. to apply different operations to the pre-edit texts. Text fields in websites / applications should therefore never intercept raw keyboard inputs (e.g. via keydown events) without extra precautions.

If your application requires the use of keydown, you may read more about using keydown without messing up IME composition here.

Word Count

Let's say you are making a blog post system and have decided to count the number of words in each post. Well, one way to do that is to of course count the number of spaces, but there are numerous languages (Thai, Chinese, Japanese) that do not contain spaces at all.

The "solution" is to of course also include the count for number of characters. For the above languages, word count does not make sense at all, so character count should be used instead.

Basically, comparing number of characters between posts in different languages don't make sense for numerous reasons (vocabulary lengths are different between different Latin languages too), so it's pretty much only useful for comparison between the same languages in general.

By the way, if you are trying to limit the number of words / characters in a post, maybe you should consider counting in bytes instead. Just saying.


Conclusions

There are many things that need to be taken account of when dealing with translations. All languages come with different corresponding cultures, and there are lots of people on Earth that simply do not speak English. It is therefore crucial to improve the i18n/l10n process in order to make it easier to accommodate everyone.