Language Names, In Those Languages

While looking around for a decent spreadsheet containing a map between the ISO 639-1 two-letter language codes and localized versions of language names, I could not find a straightforward version of this information published in a sensible form like CSV or as a JSON object.

What I'm looking for is something like:

var languageNameMap = {
	'de': { 'de': 'Deutsch',  'en': 'Englisch', 'fr': 'Französisch' },
	'en': { 'de': 'German',   'en': 'English',  'fr': 'French' },
	'fr': { 'de': 'Allemand', 'en': 'Anglais',  'fr': 'Français' }
};

This way, when I want to display the name of a language in a particular locale, I can just do a simple lookup: languageNameMap[localeCode][languageCode].

So here is my attempt at putting together something like this, using reference material from the Unicode Common Locale Data Repository{target="_blank"}.

Inside of the core.zip file, a number of locale definition files are located under common/main/*.xml:

-rw-r--r--@ 1 user  staff    8202 Aug  1 22:53 aa.xml
-rw-r--r--@ 1 user  staff     702 Apr 27  2011 aa_DJ.xml
-rw-r--r--@ 1 user  staff   99651 Oct 11 12:59 af.xml
-rw-r--r--@ 1 user  staff    2033 Sep 23 15:02 af_NA.xml
-rw-r--r--@ 1 user  staff     297 May  5  2009 af_ZA.xml
-rw-r--r--@ 1 user  staff   27386 Sep 23 15:02 agq.xml
-rw-r--r--@ 1 user  staff     298 Aug  1 22:53 agq_CM.xml
-rw-r--r--@ 1 user  staff   24593 Oct 11 03:06 ak.xml
[...]

Each of these files contains a list of the world's languages, as they would be named in that locale.

For example, in the German language locale definition file "de.xml" and many of the other files, there's a "languages" list that looks like:

<languages>
	<language type="aa">Afar</language>
	<language type="ab">Abchasisch</language>
	<language type="ace">Aceh-Sprache</language>
	<language type="ach">Acholi-Sprache</language>
	<language type="ada">Adangme</language>
	[...]
</languages>

Now, let's say we want the names of of the German language and the English language, in those languages, respectively. The output should be a grid of 2 x 2 language name pairs.

I've written a PHP script to parse the necessary locale definition files and to create a JSON object containing this information:

<?php

// Usage:
// create_language_map.php CLDR-main-dir "2 letter language codes separated by spaces"

$cdrlDir = $argv[1];
$which   = $argv[2];
$whichList = explode(' ', $which);

foreach($whichList as $languageA)
{
	$localeFile = $cdrlDir . "/" . "$languageA.xml";

	// Load the locale definition file.
	if ($data = simplexml_load_file($localeFile))
	{
		/*
		array(1) {
		  [0]=>
		  object(SimpleXMLElement)#11 (2) {
			["@attributes"]=>
			array(1) {
			  ["type"]=>
			  string(2) "en"
			}
			[0]=>
			string(20) "الإنجليزية"
		  }
		}
		*/

		// Loop over the language codes and get the names we want.
		foreach ($whichList as $languageB)
		{
			$L = $data->xpath("//languages/language[@type='$languageB']");
			if (is_array($L) && 1 == count($L))
			{
				// Coerce to string.
				$output[$languageA][$languageB] = "" . $L[0];
			}
		}
	}
}

if (defined(JSON_PRETTY_PRINT))
	print json_encode($output, JSON_PRETTY_PRINT);
else
	print json_encode($output);

?>

When it is run from the command line (and with a little help from Python, since OS X doesn't by default ship w/a pretty-printing PHP) the following should pop out:

$ php create_language_map.php ~/Downloads/core/common/main "de en" | python -m json.tool
{
    "de": {
        "de": "Deutsch", 
        "en": "Englisch"
    }, 
    "en": {
        "de": "German", 
        "en": "English"
    }
}

So if you want to use this later to display the name of the English language in German, you just do something like languageNameMap['de']['en']; (I realize it might even be easier to rewrite the script so it's LNM['en']['de'] instead, but I'll leave that as an exercise to the reader.)