Internationalization and Localization

Googleing "i18n" returns sites where we can read:

"Internationalization refers to the process of developing programs without prior knowledge of the language, cultural data, or character-encoding schemes that the programs are expected to handle. In other words, internationalization refers to the availability and use of interfaces that let programs modify their behavior at run time for operation in a specific language environment. The abbreviation I18N is often used to stand for internationalization, as there are 18 characters between the beginning "I" and the ending "N" of that word."

"A concept related to internationalization is localization (L10N), which refers to the process of establishing information within a computer system for each combination of native language, cultural data, and coded character set (codeset). A locale is a database that provides information for a unique combination of these three components. However, locales do not solve all of the problems that localization must address. Many native languages require additional support in the form of language-specific print filters, fonts, codeset converters, character input methods, and other kinds of specialized software."

The Goals

Only a few programs can change their languages at runtime. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, only one language was bought at time, this was not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users, since many apps like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG was then only in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a project from http://portland.freedesktop.org.

XDG saves defaults paths in english for "Desktop" or "Downloads" in a file named "usr-dirs.defaults" for each user, while in "locale" saves the default language, i.e., the language selected during system installation or later in System > Language Support, at last, at login time, the system translates paths to the user's language and save them in "user-dirs.dirs" in the user's home folder. This customizes language and path for personal folders.

Then, the problem was using other language than english with applications that didn't support XDG. Firefox 2 and Thunar thought that desktop folder was called "Desktop" what was true only in english; in spanish, "Desktop" is just a normal one more folder, despite that Thunar glyph for it was the desktop own icon.

(El problema se presenta en las aplicaciones no Xdg y sólo en idiomas distintos del inglés. Firefox2 y Thunar creen que la carpeta que corresponde al escritorio es Desktop, lo cual es cierto sólo en inglés. En español, Desktop es una carpeta más, aunque Thunar la represente con el icono característico del escritorio.)

The System · Codesets

We all know that a codeset or charset is the binary representation of a set of characters and we also know the 7 bits lenght ASCII and that "A" is "65 decimal". Before ASCII, each system had its own codeset; then ASCII came for information interchange, this must be understood as system information interchange, i.e., this was enough for systems to interchange source code and english scripts, because when we think about letters, english alphabet (the modern basic latin) is around 26 and it has no accents and the like, but we soon realize it's not enough for multi-language characters, with their letter combinations, acutes, umlauts, cedillas and the like; then ANSI came to get use of the most significant 8th bit that in ASCII we could use for parity or custom characters, here in ANSI, an 8 bits lenght code, it's used for adding 128 more characters. But languages around the planet are too much rich to fit in 8 bits; hence, ISO-8859 came to embrace other languages than english. ISO charsets, as ANSI, are ASCII based, i.e., the same first 128 characters remain for compatibility. Characters are not letters but graphical representations, i.e., glyphs, and even the 16 iso sets that exist for several groups of languages are not enough for all languages, neither for a few ones, so the first criteria was to design ISO for information, not typography, what could save some letters from having their corresponding character, as the french "oe", the finish "ij" or the spanish "ch" and "ll".

For all languages to be represented, this means that a character from the 128 upper part is undefined until the charset is declared, and if the charset changes, the writing and its relations changes with it, moreover, since some languages can use several ISO charsets, declaring the language is not enough. But first, we must take into account that the same language is not used the same in different regions. For these reasons the expression that defines a locale is the following: [language][_territory][.charset][@modifier]

Then came both Unicode and ISO-10646 Universal Character Set, with different flavors, two bytes unicode 16, multibyte utf-8, etc. Unicode takes account of all characters in all languages, UTF-8 comes in help to solve multi-language, while introducing some issues, because it's multi-byte variable, is more complex, less supported, and data grows in size. For spanish, "a" and "á" are the same letter but for the system they are different characters; recently, we have "Á" too, since the rule that uppercase had no acute was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded (but at least they are case related). There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.

I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. The important part is that I don't need to change codeset and this is an advantage because systems may use every iso and also utf-8, but only utf-8 and one iso are installed; i.e., we must install additional iso to change between them, and doing so will make some characters in the upper 128 to change.

The Issues

About the last part "locales do not solve ...", I agree. We remember the time when everything was written uppercase; here in Spain media news were worried about the "ñ". Nowadays the issue continues, what about spaces in uri? or, now it's possible to register such domain name but the goal is to use it, here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8 because we know the issue is that it's displayed as latin1, but my system can't guess it. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has. This has nothing to do with charsets but locales.

(Phil: You can modify the timezone system parameter in config.php - the installer also allows this to be set as of 3.11 - independent of the locale selected. In webERP only LC_MESSAGES is affected by the selection of the locale - i.e. the words/characters in the interface)

Representing characters it's not the only goal, we want lists to be sorted what needs the use of character comparison functions like "a < b" and we also want to make use of other string functions like lenght, pos and case functions. I don't know turkish and a few days ago I discovered something that turkish have always known, a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a latin1 site like this, only because it cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be displayed following latin-1, moreover, there is no HTML entities for turkish, but the case is that "I = uppercase(i)" returns false when iso-8859-9 is used.

We don't have to travel to exotic countries to see some issues when we want a list to be sorted in alphabetical order. ASCII binary order matches alphabet order, iso and utf-8 don't, it's a compromise. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, these are worse because they are letters represented by two; double letters will always need a special treatment when ordering lists. In Spain we are used to see "ch" between "cg" and "ci" instead of being after all "c". MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property and we shouldn't use it dynamically.

Text Files

As we have learned, there is no plain text outside the boundaries of ASCII, i.e., outside english, and as it frecuently happens, now there are more than only one standard, and system needs to know the code first. Plain text files don't have a header to tell which charset was used to binary represent the script. We have several kinds of text files since html, php and .po are just text files, but at least html and .po have headers to tell the system the codeset they were written with, php don't. When a file has a header to tell the codeset, the header declaration must match with the codeset used when the file was saved. When not, it's better to use UTF-8 without BOM (Byte Order Mark). Using another ISO than latin1 makes the file local, reducing interchangeability, since the codeset must be advised. As in C or ANSI C, the code must stay in the codeset boundaries to be read by the interpreter or the compiler and this must be not changed. The text editor I use in Linux is "gedit"; gedit makes a backup copy of edited files and saves the codeset. This way it knows what codeset to use when opening a file, but if I delete a file and copy another with the same name and different codeset, gedit opens the new file using the old codeset.

webERP

To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. It's time to go with webERP. webERP has a very good multi-language support based on gettext.

If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can.
 About entities, why should I write "Los &Aacute;ngeles" when it's possible to write "Los Ángeles". 
 About hard code, this is an example, S:= 'MyFieldName = ''' + MyForm.MyEditControl.Text + ''''; 


(Phil: We do have some html entities in webERP and I would like to understand more about the best approach here too)

Again, it's difficult to show the example since hard code uses the same character combination than wiki. We see it all the time, special characters. They act like control signals not data. It must be used a special way to show them as data.

webERP default language is en_GB, charset iso-9985-1, which supports a well-known good amount of languages, some of them with well-known issues, but the program works perfectly well within the boundaries of iso-8859-1. This means a multi-language support for all those languages. But there are much more languages or, to be more precise, a few more iso sets.

Some people may want wE to work with a/some language/s not in the latin1 charset. Here we may ask ourselves if these languages belongs to an iso charset or they need more than one iso charset. The second means a very dynamic changing of charset and maybe it should be better to change to utf-8 since this is what utf-8 was designed for.

Because wE is written in english, it has a very good, although not total, compatibility with systems in other languages, but often, it happens that this compatibility makes english developers to bypass certain checks. Outside iso-8859-1 or the almost identical iso-8859-15, wE needs some tracing and debugging yet. A lot of work is being made, and some others charset have been proved to work, but not all. Some minor arrangements can be done by users, at least what php, html, Apache, MySQL and gettext is concerned to. The most difficult part is pdf, and because a reporting tool is an essential part of a database application, not solving this issue makes the application almost unusable for a limited number of people with a very special language from the system point of view. (Phil: Yes indeed the major stumbling block!!)

Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if neccesary, this is free software. This includes all files. Some php must be edited to work with turkish and others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:

To be continued ...


Javier De Lorenzo-Cáceres Cruz
aese, aplicaciones software, s.l.
Sta. Cruz de Tenerife, Canarias.

(Phil: A superb start on the mutlti-language issues Javier)
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki