Prepare your codebase well and prevent localization defects that cost time, money, and credibility

TranslationLoft | Internationalization
Internationalization (i18n)


Be internationalization-ready


Internationalization (I18n) refers to the process of ensuring your codebase is ready for localization, i.e. that it is language- and locale-neutral.
Code-specific implementation varies across the different technology stacks, but the common objective is to externalize, externalize, externalize. There are also a number of other fundamental rules that can help ensure your software is internationalized by design. Read on to find out more.

Separate text from code


As we have seen, the first rule of internationalization is to extract all language assets from the code. All text that is displayed to a user, and that will therefore require translation, should be managed by way of resource files, and should never be referenced literally in your codebase. This is a fundamental tenet of internationalization and ensures that there is a clear distinction between code – which is the developer's domain – and language, which is the technical writer/localizer's domain. Ideally, this will have been the approach from the outset, but in the real world, it is more often the case that code is written without support for language and locale constraints until this becomes a concrete requirement.

The internationalization spec for your particular technology stack/development framework will provide specific implementation details, but general i18n best practice is based on the concept of an external resource file with a key-value pair structure. Each language element, such as a UI label or message, will have an entry in this file. Each entry in turn will have a key, which identifies the element, and a value, which contains the element content. Application code then performs a lookup using a particular key and displays the returned value. The clear advantage is that we can support any language with zero code impact. The code-based call will always remain the same. The resource file/content will change as required.

Avoid string concatenation


Developers have been known to indulge in concatenation to create text output, and while this practice won't have any negative impact where the language is English, it can lead to defects when content is localized. Consider the following user message and the (intentionally granular) resource file structure:
Please select the gear icon to change your language preference settings

      {
        "lang": "EN",
        "keys": [
         {
            "key": "select.gears",
            "value": "Select the gears icon"
          },
         {
            "key": "to.change",
            "value": "to change your"
          },
          {
            "key": "lang.settings",
            "value": "language preference settings"
          } 
      }
...

jsonObj.getString("select.gears")+jsonObj.getString("to.change")+jsonObj.getString("lang.settings");


This implementation assumes language-specific rules such as word order, rules that won't necessarily be shared by all target languages. This could mean that although a resource file has been translated correctly, the corresponding output on the screen is mangled. It also means that the individual resource file entries lack context and are difficult to translate, leading to incoherent text in the target language. It is always cleaner and safer to use complete sentences as implemented below:

      {
        "lang": "EN",
        "keys": [
          {
            "key": "user.settings.change_lang_prefs",
            "value": "Select the gear icon to change your language preference settings"
          } 
      }


Use UTF Encoding


Many languages have unique characters such as French accents and German umlauts and in order to ensure that these display correctly, Unicode character encoding is a must. UTF-8 encoding makes character corruption a thing of the past and should be implemented throughout your stack. Many programming languages, databases, and application servers will store files in their default encoding, so it's important to explicitly set the encoding of client files in both HTML/CSS and on the HTTP server.

HTML/CSS
Always declare the encoding of your HTML document using the meta element and charset attribute. The declaration needs to be included in the first 1024 bytes of the file, so it's important to place it after the opening <head> tag. File-level declaration alone is not sufficient to ensure the correct encoding – always remember to actually save the file in UTF-8 format.

<!DOCTYPE html>
<html lang="en"> 
<head>
<meta charset="utf-8"/>
...


HTTP Server
Character encoding in the HTTP header will always take precedence over any in-document encoding declarations, so it is important to verify what encoding is used here, if any. There is a very useful W3C utility for checking the HTTP header encoding of any document here. If you have access to your server settings, you may need to change the encoding for the HTTP header content type so that it matches your file declarations.

Content-Type: text/html; charset=utf-8
...



Currencies, Symbols, Formats


Localization involves a lot more than just word-for-word translation of individual terms. It also involves the adaptation of language-specific elements such as date and time, currency, and units of measure, all of which vary across locales. As such, hard-coded references will cause problems, which is why as much of the content as possible should be externalized to ensure glitch-free localization. The examples below are just a few of the possible variations for dates, currencies, and quotation marks that have to be catered for during the internationalization process.

en-gb: 26/03/2016 
en-us: 03/26/2016 
de-de: 26.03.2016

en-gb: €100,000.99
fr-fr: 100.000,99 €

en-gb: "Look Ma - I'm in quotes"
de-de: ‟Look Ma - quotes are different in German„

Annotate for context


Often, the same word can have different meanings and it's only when context is added that the correct meaning becomes clear. To take a concrete example, consider the word "Contact". In English, "contact" could be a verb or a noun. A CRM application might use the word “contact” as a noun on a UI label to designate a contact person, and the same word as a verb on a UI button to mean “get in touch”. In most languages, these are two distinct words, and given that translators could be restricted to working with a standalone resource file, accurate translation without an indication of context can be difficult. Adding some annotation to the resource file is a simple but very effective way of ensuring that the end translator is crystal clear about the meaning of each string:

      {
        "lang": "EN",
        "keys": [
          {
            "key": "contact",
            "value": "Contact"
            "notes": "Verb/Used as button label"

          } 
      }
Design for greater content length


English is, comparatively speaking, a concise language. Relative to other some languages, words, sentences, phrases are short and require less characters. The impact of this on localization is that text-based UI components can lengthen substantially, and distort the UI layout as a result. In order to avoid having to redesign or adjust your UI at the localization stage, factor in-text expansion into the design of your code.

Avoid images with text


If you have images with embedded text in your application such as buttons, consider refactoring to use CSS (Cascading Style Sheets) where possible. The objective of internationalization is to have as much, if not all, of your language content in a single external file. Exceptions to this basic rule, such as images that have to be edited outside the main resource file, are most at risk of slipping through the i18n net.