Locale Fun

Peter and I have been updating the Tandem Exchange site translations over the past few weeks, and are now able to announce support for Arabic, Japanese, Korean, Simplified Chinese, and Traditional Chinese as first-class UI languages on the website in addition to the original German, Spanish, French, Italian, Portuguese, and Russian translations that we did first.

There was a moment on Monday when I let slip that the website was starting to take on the feel of a “real” website, and that is true. We now have 12 full translations of the website, which Google can grok for each of its region-specific search indices, and which we can serve to a wide swath of users around the world.

One of the weirder things about multiple-language support learned along the way, is the way the HTTP Accept-Language header formats a user’s UI language preference vs. the way GNU gettext specifies the same information vs. the way the Unicode Consortium specifies the same information vs. the way Facebook, Google, and Twitter specify the same information. And then how Django chooses to use that information when setting the site’s presentation language and looking up the correct gettext catalogs.

HTTP

The HTTP Accept-Language header usually looks something like: “Accept-Language:en-US,en;q=0.8” where the language codes are specified by a massive, long-winded standard called BCP47. The standard defines a language code to look something like en-US or zh-CN or zh-TW, but the standard doesn’t really bother giving a list of common language codes, leaving that instead to the combinatorial explosion that results by mixing together one entry from each of the categories of the IANA Subtag Registry.

Django

In the Django settings.py file, you specify the list of languages that the website is supposed to support, and you do this using language-code strings that look like this, where the first entry in the tuple looks roughly like BCP47:

LANGUAGES = (
    ('ar', 'العربية'),
    ('de', 'Deutsch'),
    ('en', 'English'),
    ('es', 'Español'),
    ('fr', 'Français'),
    ('it', 'italiano'),
    ('ja', '日本語'),
    ('ko', '한국어'),
    ('pt', 'português'),
    ('ru', 'ру́сский'),
    ('zh-cn', '简体中文'),
    ('zh-tw', '繁體中文'),
)

Note, however, that the language-code is strictly lowercase and uses a hyphen as a separator.

gettext

When building the translation catalogs for Django, you have to use gettext’s locale-naming format, which looks something like en_US or zh_CN.

So when you’re looking at your app’s locale/ subfolder, it will end up containing directories looking something like:

$ ls locale
ar  de  en  es  fr  it  ja  ko  pt  ru  zh_CN  zh_TW

Note the difference in the separator (hyphens vs. underscores) and the region code (lower vs. upper case). On a case-sensitive filesystem, you have to get this exactly right.

Unicode Consortium CLDR

But let’s say you’re also using the Unicode Consortium’s Common Locale Data Repository to generate some information instead of having to write it all out yourself? They use a slightly different set of language/region/locale identifiers, which are generally sensible but in the case of the Chinese languages, use zh_Hans and zh_Hant as parts of the filenames and language identifiers:

ee_TG.xml       fr_CF.xml        kw_GB.xml       sah_RU.xml       yo.xml
ee.xml          fr_CG.xml        kw.xml          sah.xml          zh_Hans_CN.xml
el_CY.xml       fr_CH.xml        ky_KG.xml       saq_KE.xml       zh_Hans_HK.xml
el_GR.xml       fr_CI.xml        ky.xml          saq.xml          zh_Hans_MO.xml
el.xml          fr_CM.xml        lag_TZ.xml      sbp_TZ.xml       zh_Hans_SG.xml
en_150.xml      fr_DJ.xml        lag.xml         sbp.xml          zh_Hans.xml
en_AG.xml       fr_DZ.xml        lg_UG.xml       se_FI.xml        zh_Hant_HK.xml
en_AS.xml       fr_FR.xml        lg.xml          seh_MZ.xml       zh_Hant_MO.xml
en_AU.xml       fr_GA.xml        ln_AO.xml       seh.xml          zh_Hant_TW.xml
en_BB.xml       fr_GF.xml        ln_CD.xml       se_NO.xml        zh_Hant.xml
en_BE.xml       fr_GN.xml        ln_CF.xml       ses_ML.xml       zh.xml
en_BM.xml       fr_GP.xml        ln_CG.xml       ses.xml          zh.xml~
en_BS.xml       fr_GQ.xml        ln.xml          se.xml           zu.xml
en_BW.xml       fr_HT.xml        lo_LA.xml       sg_CF.xml        zu_ZA.xml
en_BZ.xml       fr_KM.xml        lo.xml          sg.xml

Facebook

Facebook passes back the locale of their users like so: ar_AR, zh_CN, zh_TW, and there’s a list of their supported languages and locales here. Of the bunch, they’re the most consistent, sticking to a [2-letter language code + 2-letter ISO country name] combo.

Google

Google passes back the locale of their users like so: ar, zh-CN, zh-TW, and there’s a list of their supported languages and locales here. They mix and match pure 2-letter language codes with [2-letter language code + 2-letter ISO country name] and even [2-letter language code + 3-letter ISO region code] combos. Sigh.

Twitter

Twitter generally passes back the locale of their users using 2-letter language codes such as ar, de, en, etc., but for Chinese they pass back zh-cn and zh-tw for Simplified and Traditional Chinese. Of course, this is pure speculation, because the most detailed available info about this comes by reading the source of their Tweet button generator.

So…

Needless to say, it all gets a bit confusing. But the point is this:

  1. In the Django settings.LANGUAGES list, strip down your language codes to 2-letters if possible, use all-lowercase, and hyphens to separate a [language code – region code] identifier, or Django will complain.
  2. On disk, make sure your translation files are in directories like locale/en, locale/zh_CN, and so on, with underscores and capital letter region codes where necessary.
  3. And if you’re ever using OAuth to authenticate your incoming users, make sure to process the locale information coming from Facebook, Google, or Twitter, into the lower-case, hyphen-separated form used by Django, before you write it into the user’s profile.

One Final Thing

It was definitely interesting to see what kind of changes to the styling and layout were necessary to support the Arabic language. One of the tricky things about making sure things style properly in right-to-left mode is the fact that CSS float: property and text-align: property do not take on opposite meanings when you set the text-direction in the body content.

So in our case, we had a panel that relied on floats to style properly.

In the left-to-right case, it looks like:

ltr-panel

In the right-to-left case, it looks like:

rtl-panel

To get it to do this, we had to add a {% if LANGUAGE_BIDI %}rtl{% endif %} rule to each of the CSS classes that needed the explicit float: property change, then specify those styles like so:

And that’s it, for now.

Localization / Translation Using Google Spreadsheets

So I spent a little time over the weekend hacking together a piece of code to help export usable localization / translation files from a simple spreadsheet in Google Docs. With a little extra effort setting up Protected Ranges and giving other users editing permissions, team translation using the same spreadsheet should be pretty easy. Currently, it can export Django-style gettext .po files and jquery.localize-compatible .json translation files.

This was an experiment in learning how to use Google Spreadsheets in combination with Google App Scripts to create a localization table that can be accessed via JSON/JSONP. This is useful because there’s still no useful built-in JSON publishing option for individual spreadsheets.

So here is the localization spreadsheet URL that I’m using as my source (spreadsheet key highlighted):
https://docs.google.com/spreadsheet/ccc?key=0AqrUvD5TZZs3dF9ULUh5X1JlakVJRGFHaWRZQmFuZEE

It looks like this:

And here is the Google App Script URL that will generate localization string tables from that spreadsheet:
https://script.google.com/macros/s/AKfycbxLnEUyElPtL01qHnL7pD2hmTmaO7Tc1yLhjJzQpitpuBfxxBU/exec

If you put the two together, with the querystring sheet_id set to the spreadsheet key and sheet_name set to an appropriate sheet name inside that spreadsheet, you get the following:
https://script.google.com/macros/s/AKfycbxLnEUyElPtL01qHnL7pD2hmTmaO7Tc1yLhjJzQpitpuBfxxBU/exec?sheet_id=0AqrUvD5TZZs3dF9ULUh5X1JlakVJRGFHaWRZQmFuZEE&sheet_name=Main

And when you retrieve that link, it generates the following (folded for brevity):

{
    "de": {
        "string_with_quotes": "Mit \"\"", 
        "potato": "Kartoffel", 
        "language_code": "de", 
        "hello": "Guten Tag!", 
        "chanterelle_mushroom": "Pfifferlinge", 
        "how_are_you": "Wie geht's?", 
        "string_with_comma": "Mit, Komma", 
        "language_name": "Deutsch", 
        "string_with_colon": "With:", 
        "text_direction": "ltr", 
        "string_with_newlines": "Mit \n\n"
    }, 
    "zh-Hant": {},
    "zh-Hans": {},
    "de-AT": {
        "string_with_quotes": "Mit \"\"", 
        "potato": "Erdapfel", 
        "language_code": "de-AT", 
        "hello": "Guten Tag!", 
        "chanterelle_mushroom": "Eierschwammerl", 
        "how_are_you": "Wie geht's?", 
        "string_with_comma": "Mit, Komma", 
        "language_name": "Deutsch (Österreich)", 
        "string_with_colon": "With:", 
        "text_direction": "ltr", 
        "string_with_newlines": "Mit \n\n"
    }, 
    "fr": {},
    "en": {
        "string_with_quotes": "With \"\"", 
        "potato": "potato", 
        "language_code": "en", 
        "hello": "Hello!", 
        "chanterelle_mushroom": "chanterelle mushroom", 
        "how_are_you": "How are you?", 
        "string_with_comma": "With, comma", 
        "language_name": "English", 
        "string_with_colon": "With:", 
        "text_direction": "ltr", 
        "string_with_newlines": "With\n\n"
    }, 
    "ja": {
        "hello": "こんにちは!",
        "language_code: "ja",
        "text_direction: "ltr",
        "string_with_comma: "With, comma",
        "string_with_newlines: "With \n\n",
        "how_are_you: "お元気ですか?",
        "string_with_quotes": "With \"\"",
        "string_with_colon": "With:",
        "potato": "potato",
        "language_name": "日本語",
        "chanterelle_mushroom": "chanterelle mushroom"
    }
}

The spreadsheet format follows a few conventions: By convention, the left-most column (Column 0) is the keystring, which you use to access the translation value later. The next column to the right (Column 1) is the source language or default-language column, in this case English, which should contain all of the original strings you need to localize. By convention, the language_code row is the top-most row (Row 0). Note that the export script will replace untranslated strings in a target language first with the translations from the closest base-language language_code, so in the case of Austrian German “de-AT”, it pulls in the translations from the generic German “de” language_code column; then, for anything missing in that column, it pulls from the default-language column.

I’ve set up a github repository to capture further development, and have been looking at the various file formats supported by Transifex, to see if it would be possible to generate output from the spreadsheet for some of them. Unfortunately, Google App Scripts doesn’t let you generate and immediately return a ZIP file, for which I filed a bug. To be exact, you can generate the ZIP file, but Google provides no good way to return the raw bytes and it’s unclear whether they ever will. Nonetheless, there are two Python scripts in to github repo that generate useful gettext and JSON files. I think the most important (which I’ll add eventually unless someone beats me to it) formats that need support would be Android, Windows, and OSX/iOS string resource files.

Helper Scripts: jquery.localize

Looking at the github repository, there’s one script under the “jquery.localize” folder called “generate-language-pack.py”, which generates files that can be used with the jquery.localize plugin:

translation-de-AT.json
translation-de.json
translation-en.json
translation-fr.json
translation-ja.json

The file contents look like:

{
    "string_with_quotes": "Mit \"\"",
    "potato": "Kartoffel",
    "language_code": "de",
    "hello": "Guten Tag!",
    "chanterelle_mushroom": "Pfifferlinge",
    "how_are_you": "Wie geht's?",
    "string_with_comma": "Mit, Komma",
    "language_name": "Deutsch",
    "string_with_colon": "Mit:",
    "text_direction": "ltr",
    "string_with_newlines": "Mit \n\n"
}

I’ve set up an example website using jquery.localize and a simple $.click() handler, to show the translations in action.

Helper Scripts: gettext

Under the “gettext” subdirectory in the github repository, there’s a file called “localize-django-app.py”, which, when run inside a Django app directory, prints out the following:

Markdown unavailable.
Creating locale/de/LC_MESSAGES/django.po
Creating locale/de_AT/LC_MESSAGES/django.po
Creating locale/en/LC_MESSAGES/django.po
Creating locale/fr/LC_MESSAGES/django.po
Creating locale/ja/LC_MESSAGES/django.po
Creating locale/zh_Hans/LC_MESSAGES/django.po
Creating locale/zh_Hant/LC_MESSAGES/django.po

And generates gettext catalogs that look like:

If you have Markdown installed, the script will run markdown.markdown() on the msgstr translation values before outputting them.

Google App Script Source

I’ve also set the export script to be viewable via the following link: https://script.google.com/d/167d-d6YtX74ZOZ0auMlPdd6emqusmWy5wUqhruKo9uu8AQoaoc3yvZsP/edit?usp=sharing, but I think you might have to log in to view it. It doesn’t seem like you can just see the script w/o having a Google Account. If you’d like to collaborate on making this better, let me know and I can grant edit access.

Damn You, Emacs Autoinsert

Emacs can also be annoying when it autoinserts text without checking to see if it’s already there. Case in point, gettext .po files under Emacs 23.1. For whatever reason, my copy kept inserting this header whenever I opened a translation file:

If this header sneaks into a .po file, when you try to compile the file with Django’s compilemessages command, you get the following error:

What seems to happen is that this header is only inserted when you use C-x C-f to open a file. When opening a file directly from the command line, this corruption does not seem to occur.

If you look at the PO Group configuration options, this text is listed there as the default PO file header. I haven’t validated this as a solution, but you should be able to switch this value to be the comment character “#”, and that should take care of the problem.

Also, why the hell does the PO major mode not allow you to destroy entire msgid’s and their translations? This is really annoying. Yes, it’s nice sometimes when programs limit your options, but in this case, Emacs was messing up by inserting the bad header, then refusing me the option of removing it.

Language Names, In Those Languages

While looking around for a decent spreadsheet containing a map between the ISO 639-1 two-letter language codes and localized versions of language names, I could not find a straightforward version of this information published in a sensible form like CSV or as a JSON object.

What I’m looking for is something like:

This way, when I want to display the name of a language in a particular locale, I can just do a simple lookup: languageNameMap[localeCode][languageCode].

So here is my attempt at putting together something like this, using reference material from the Unicode Common Locale Data Repository.

Inside of the core.zip file, a number of locale definition files are located under common/main/*.xml:

Each of these files contains a list of the world’s languages, as they would be named in that locale.

For example, in the German language locale definition file “de.xml” and many of the other files, there’s a “languages” list that looks like:

Now, let’s say we want the names of of the German language and the English language, in those languages, respectively. The output should be a grid of 2 x 2 language name pairs.

I’ve written a PHP script to parse the necessary locale definition files and to create a JSON object containing this information:

When it is run from the command line (and with a little help from Python, since OS X doesn’t by default ship w/a pretty-printing PHP) the following should pop out:

So if you want to use this later to display the name of the English language in German, you just do something like languageNameMap[‘de’][‘en’]; (I realize it might even be easier to rewrite the script so it’s LNM[‘en’][‘de’] instead, but I’ll leave that as an exercise to the reader.)