Revision history for Internationalization


Revision [1640]

Last edited on 2009-10-13 06:28:15 by JavierDeLorenzo
Additions:
Only a few programs can change their languages at runtime. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, only one language was bought at time, this was not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users, since many apps like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG was then only in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a project from http://portland.freedesktop.org.
Deletions:
Only a few programs can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, only one language was bought at time, this was not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users, since many apps like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG was then only in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a project from http://portland.freedesktop.org.


Revision [1639]

Edited on 2009-10-13 06:22:08 by JavierDeLorenzo
Additions:
Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if neccesary, this is free software. This includes all files. Some php must be edited to work with turkish and others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:
Deletions:
Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if necesary, this is free software. This includes all files. Some php must be edited to work with turkish and others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:


Revision [1638]

Edited on 2009-10-13 03:57:37 by JavierDeLorenzo
Additions:
Representing characters it's not the only goal, we want lists to be sorted what needs the use of character comparison functions like "a < b" and we also want to make use of other string functions like lenght, pos and case functions. I don't know turkish and a few days ago I discovered something that turkish have always known, a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a latin1 site like this, only because it cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be displayed following latin-1, moreover, there is no HTML entities for turkish, but the case is that "I = uppercase(i)" returns false when iso-8859-9 is used.
We don't have to travel to exotic countries to see some issues when we want a list to be sorted in alphabetical order. ASCII binary order matches alphabet order, iso and utf-8 don't, it's a compromise. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, these are worse because they are letters represented by two; double letters will always need a special treatment when ordering lists. In Spain we are used to see "ch" between "cg" and "ci" instead of being after all "c". MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property and we shouldn't use it dynamically.
As we have learned, there is no plain text outside the boundaries of ASCII, i.e., outside english, and as it frecuently happens, now there are more than only one standard, and system needs to know the code first. Plain text files don't have a header to tell which charset was used to binary represent the script. We have several kinds of text files since html, php and .po are just text files, but at least html and .po have headers to tell the system the codeset they were written with, php don't. When a file has a header to tell the codeset, the header declaration must match with the codeset used when the file was saved. When not, it's better to use UTF-8 without BOM (Byte Order Mark). Using another ISO than latin1 makes the file local, reducing interchangeability, since the codeset must be advised. As in C or ANSI C, the code must stay in the codeset boundaries to be read by the interpreter or the compiler and this must be not changed. The text editor I use in Linux is "gedit"; gedit makes a backup copy of edited files and saves the codeset. This way it knows what codeset to use when opening a file, but if I delete a file and copy another with the same name and different codeset, gedit opens the new file using the old codeset.
Deletions:
Representing characters it's not the only goal, we want lists to be sorted what needs the use of character comparison functions like "a < b" and we also want to make use of other string functions like lenght, pos and case functions. I don't know turkish and a few days ago I discovered something that turkish have always known, a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a latin1 site like this, only because it cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be displayed following latin-1, moreover, there is no HTML entities for turkish, but the case is that "I = uppercase(i)" returns false when iso-8859-9 is used.
We don't have to travel to exotic countries to see some issues when we want a list to be sorted in alphabetical order. ASCII binary order matches alphabet order, iso and utf-8 don't, it's a compromise. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse because it's a letter represented by two; double letters will always need a special treatment when ordering lists. In spain we use to see "ch" between "cg" and "ci" instead of being after all "c". MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
As we have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first. Text files don't have a header to tell which charset was used to binary represent the writing. We have several kinds of text files since html, php and .po are also text files, but at least html and .po have headers to declare to the system the codeset it's written, php don't. When the file has a header to tell the codeset, the header declaration must match with the codeset used when the file was saved. When not, it's better to use utf-8 without BOM (Byte Order Mark). Using another iso than latin1 makes the file local, reducing interchange, since the codeset must be advised. As in C or ANSI C, the code must stay in the codeset boundaries to be read by the interpreter or compiler and this must be not changed.


Revision [1637]

Edited on 2009-10-12 14:23:05 by JavierDeLorenzo
Additions:
Only a few programs can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, only one language was bought at time, this was not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users, since many apps like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG was then only in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a project from http://portland.freedesktop.org.
Then, the problem was using other language than english with applications that didn't support XDG. Firefox 2 and Thunar thought that desktop folder was called "Desktop" what was true only in english; in spanish, "Desktop" is just a normal one more folder, despite that Thunar glyph for it was the desktop own icon.
(El problema se presenta en las aplicaciones no Xdg y sólo en idiomas distintos del inglés. Firefox2 y Thunar creen que la carpeta que corresponde al escritorio es Desktop, lo cual es cierto sólo en inglés. En español, Desktop es una carpeta más, aunque Thunar la represente con el icono característico del escritorio.)
We all know that a codeset or charset is the binary representation of a set of characters and we also know the 7 bits lenght ASCII and that "A" is "65 decimal". Before ASCII, each system had its own codeset; then ASCII came for information interchange, this must be understood as system information interchange, i.e., this was enough for systems to interchange source code and english scripts, because when we think about letters, english alphabet (the modern basic latin) is around 26 and it has no accents and the like, but we soon realize it's not enough for multi-language characters, with their letter combinations, acutes, umlauts, cedillas and the like; then ANSI came to get use of the most significant 8th bit that in ASCII we could use for parity or custom characters, here in ANSI, an 8 bits lenght code, it's used for adding 128 more characters. But languages around the planet are too much rich to fit in 8 bits; hence, ISO-8859 came to embrace other languages than english. ISO charsets, as ANSI, are ASCII based, i.e., the same first 128 characters remain for compatibility. Characters are not letters but graphical representations, i.e., glyphs, and even the 16 iso sets that exist for several groups of languages are not enough for all languages, neither for a few ones, so the first criteria was to design ISO for information, not typography, what could save some letters from having their corresponding character, as the french "oe", the finish "ij" or the spanish "ch" and "ll".
For all languages to be represented, this means that a character from the 128 upper part is undefined until the charset is declared, and if the charset changes, the writing and its relations changes with it, moreover, since some languages can use several ISO charsets, declaring the language is not enough. But first, we must take into account that the same language is not used the same in different regions. For these reasons the expression that defines a locale is the following: [language][_territory][.charset][@modifier]
Then came both Unicode and ISO-10646 Universal Character Set, with different flavors, two bytes unicode 16, multibyte utf-8, etc. Unicode takes account of all characters in all languages, UTF-8 comes in help to solve multi-language, while introducing some issues, because it's multi-byte variable, is more complex, less supported, and data grows in size. For spanish, "a" and "á" are the same letter but for the system they are different characters; recently, we have "Á" too, since the rule that uppercase had no acute was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded (but at least they are case related). There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
About the last part "locales do not solve ...", I agree. We remember the time when everything was written uppercase; here in Spain media news were worried about the "ñ". Nowadays the issue continues, what about spaces in uri? or, now it's possible to register such domain name but the goal is to use it, here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8 because we know the issue is that it's displayed as latin1, but my system can't guess it. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has. This has nothing to do with charsets but locales.
Deletions:
Only a few programs can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, only one language was bought at time, this was not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users, since many apps like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
El problema se presenta en las aplicaciones no Xdg y sólo en idiomas distintos del inglés. Firefox2 y Thunar creen que la carpeta que corresponde al escritorio es Desktop, lo cual es cierto sólo en inglés. En español, Desktop es una carpeta más, aunque Thunar la represente con el icono característico del escritorio.
We all know that a codeset or charset is the binary representation of a set of characters and we also know the 7 bits lenght ASCII and that "A" is "65 decimal". Before ASCII, each system had its own codeset; then ASCII came for information interchange, this was enough when we think about letters, since english alphabet is around 26 and it has no accents and the like, but we soon realize it's not enough for multi-language characters, with their letter combinations and posibilities; then ISO came for other languages than english, ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. Because characters are not letters but graphical representations, this is not enough for all languages, so 16 iso sets exist for several groups of languages, all with the same first 128. For all languages to be represented, this means that a character from the 128 upper part is undefined until the charset is declared, and if the charset changes, the writing and its relations changes with it.
Then came Unicode with different flavors, two bytes unicode 16, multibyte utf-8, etc. Unicode takes account of all characters in all languages, UTF-8 comes in help to solve multi-language, while introducing some issues, because it's multi-byte, is more complex, less supported, and data grows in size. For spanish, "a" and "á" are the same letter but for the system they are different characters; recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded (but at least they are case related).
About the last part "locales do not solve ...", I agree. We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, what about spaces in uri? or, now it's possible to register such domain name but the goal is to use it, here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8 because we know the issue is that it's displayed as latin1, but my system can't guess it. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has. This has nothing to do with charsets but locales.


Revision [1577]

Edited on 2009-08-04 00:37:01 by PhilDaintree
Additions:
Then came Unicode with different flavors, two bytes unicode 16, multibyte utf-8, etc. Unicode takes account of all characters in all languages, UTF-8 comes in help to solve multi-language, while introducing some issues, because it's multi-byte, is more complex, less supported, and data grows in size. For spanish, "a" and "á" are the same letter but for the system they are different characters; recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded (but at least they are case related).
I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. The important part is that I don't need to change codeset and this is an advantage because systems may use every iso and also utf-8, but only utf-8 and one iso are installed; i.e., we must install additional iso to change between them, and doing so will make some characters in the upper 128 to change.
(Phil: You can modify the timezone system parameter in config.php - the installer also allows this to be set as of 3.11 - independent of the locale selected. In webERP only LC_MESSAGES is affected by the selection of the locale - i.e. the words/characters in the interface)
Representing characters it's not the only goal, we want lists to be sorted what needs the use of character comparison functions like "a < b" and we also want to make use of other string functions like lenght, pos and case functions. I don't know turkish and a few days ago I discovered something that turkish have always known, a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a latin1 site like this, only because it cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be displayed following latin-1, moreover, there is no HTML entities for turkish, but the case is that "I = uppercase(i)" returns false when iso-8859-9 is used.
(Phil: We do have some html entities in webERP and I would like to understand more about the best approach here too)
Because wE is written in english, it has a very good, although not total, compatibility with systems in other languages, but often, it happens that this compatibility makes english developers to bypass certain checks. Outside iso-8859-1 or the almost identical iso-8859-15, wE needs some tracing and debugging yet. A lot of work is being made, and some others charset have been proved to work, but not all. Some minor arrangements can be done by users, at least what php, html, Apache, MySQL and gettext is concerned to. The most difficult part is pdf, and because a reporting tool is an essential part of a database application, not solving this issue makes the application almost unusable for a limited number of people with a very special language from the system point of view. (Phil: Yes indeed the major stumbling block!!)
Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if necesary, this is free software. This includes all files. Some php must be edited to work with turkish and others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:
(Phil: A superb start on the mutlti-language issues Javier)
Deletions:
Then came Unicode with different flavors, two bytes unicode 16, multibyte utf-8, etc. Unicode takes account of all characters in all languages, UTF-8 comes in help to solve multi-language, while introducing some issues, because it's multi-byte, is more complex, less supported, and data grows in size. For spanish, "a" and "á" are the same letter but for the system they are different characters; recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded (but at least they are case related). There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. The important part is that I don't need to change codeset and this is an advantage because systems may use every iso and also utf-8, but only utf-8 and one iso are installed; i.e., we must install aditional iso to change between them, and doing so will make some characters in the upper 128 to change.
Representing characters it's not the only goal, we want lists to be sortered what needs the use of character comparison functions like "a < b" and we also want to make use of other string functions like lenght, pos and case functions. I don't know turkish and a few days ago I discovered something that turkish have always known, a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a latin1 site like this, only because it cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be displayed following latin-1, moreover, there is no HTML entities for turkish, but the case is that "I = uppercase(i)" returns false when iso-8859-9 is used.
Because wE is written in english, it has a very good, although not total, compatibility with systems in other languages, but often, it happens that this compatibility makes english developers to bypass certain checks. Outside iso-8859-1 or the almost identical iso-8859-15, wE needs some tracing and debugging yet. A lot of work is being made, and some others charset have been proved to work, but not all. Some minor arrangements can be done by users, at least what php, html, Apache, MySQL and gettext is concerned to. The most difficult part is pdf, and because a reporting tool is an essential part of a database application, not solving this issue makes the application almost unusable for a limited number of people with a very special language from the system point of view.
Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if neccesary, this is free software. This includes all files. Some php must be edited to work with turkish and others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:


Revision [1576]

Edited on 2009-07-31 06:32:52 by JavierDeLorenzo
Additions:
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. %% About entities, why should I write "Los Ángeles" when it's possible to write "Los Ángeles". %% %% About hard code, this is an example, S:= 'MyFieldName = ''' + MyForm.MyEditControl.Text + ''''; %%
Deletions:
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. %% About entities, why should I write "Los Ángeles" when it's possible to write "Los Ángeles". %% %% About hard code, this is an example, S:= 'My FieldName = ''' + MyForm.MyEditControl.Text + ''''; %%


Revision [1575]

Edited on 2009-07-31 06:31:19 by JavierDeLorenzo
Additions:
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. %% About entities, why should I write "Los Ángeles" when it's possible to write "Los Ángeles". %% %% About hard code, this is an example, S:= 'My FieldName = ''' + MyForm.MyEditControl.Text + ''''; %%
Deletions:
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. %% About entities, why should I write "Los Ángeles" when it's possible to write "Los Ángeles". About hard code, this is an example, S:= 'My FieldName = ''' + MyForm.MyEditControl.Text + ''''; %%


Revision [1574]

Edited on 2009-07-31 06:28:26 by JavierDeLorenzo
Additions:
**Internationalization and Localization**
Deletions:
lization and Localization**


Revision [1573]

Edited on 2009-07-31 06:27:04 by JavierDeLorenzo
Additions:
lization and Localization**
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. %% About entities, why should I write "Los Ángeles" when it's possible to write "Los Ángeles". About hard code, this is an example, S:= 'My FieldName = ''' + MyForm.MyEditControl.Text + ''''; %%
Deletions:
**Internationalization and Localization**
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. %% About entities, why should I write %% "Los Ángeles" when it's possible to write "Los Ángeles". About hard code, this is an example %% S:= 'My FieldName = ''' + MyForm.MyEditControl.Text + ''''; %%


Revision [1572]

Edited on 2009-07-31 06:23:31 by JavierDeLorenzo
Additions:
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. %% About entities, why should I write %% "Los Ángeles" when it's possible to write "Los Ángeles". About hard code, this is an example %% S:= 'My FieldName = ''' + MyForm.MyEditControl.Text + ''''; %%
Again, it's difficult to show the example since hard code uses the same character combination than wiki. We see it all the time, special characters. They act like control signals not data. It must be used a special way to show them as data.
Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if neccesary, this is free software. This includes all files. Some php must be edited to work with turkish and others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:
To be continued ...
Deletions:
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. Why should I write %% "Los Ángeles" %% when it's possible to write %% "Los Ángeles" %%. About hard code, this is an example %% S:= 'EXPEDIENTE=''' + Form12.Edit1.Text + ''''; %%
again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters. They act like control signals not data. It should be a convenient way to show them like data.
Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if neccesary, this is free software. This includes all files. Some php must be edited to work with turkish and maybe others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:

MySQL
To be continued ...


Revision [1571]

Edited on 2009-07-31 06:14:33 by JavierDeLorenzo
Additions:
Only a few programs can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, only one language was bought at time, this was not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users, since many apps like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
**The System · Codesets**
We all know that a codeset or charset is the binary representation of a set of characters and we also know the 7 bits lenght ASCII and that "A" is "65 decimal". Before ASCII, each system had its own codeset; then ASCII came for information interchange, this was enough when we think about letters, since english alphabet is around 26 and it has no accents and the like, but we soon realize it's not enough for multi-language characters, with their letter combinations and posibilities; then ISO came for other languages than english, ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. Because characters are not letters but graphical representations, this is not enough for all languages, so 16 iso sets exist for several groups of languages, all with the same first 128. For all languages to be represented, this means that a character from the 128 upper part is undefined until the charset is declared, and if the charset changes, the writing and its relations changes with it.
Then came Unicode with different flavors, two bytes unicode 16, multibyte utf-8, etc. Unicode takes account of all characters in all languages, UTF-8 comes in help to solve multi-language, while introducing some issues, because it's multi-byte, is more complex, less supported, and data grows in size. For spanish, "a" and "á" are the same letter but for the system they are different characters; recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded (but at least they are case related). There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. The important part is that I don't need to change codeset and this is an advantage because systems may use every iso and also utf-8, but only utf-8 and one iso are installed; i.e., we must install aditional iso to change between them, and doing so will make some characters in the upper 128 to change.
About the last part "locales do not solve ...", I agree. We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, what about spaces in uri? or, now it's possible to register such domain name but the goal is to use it, here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8 because we know the issue is that it's displayed as latin1, but my system can't guess it. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has. This has nothing to do with charsets but locales.
Representing characters it's not the only goal, we want lists to be sortered what needs the use of character comparison functions like "a < b" and we also want to make use of other string functions like lenght, pos and case functions. I don't know turkish and a few days ago I discovered something that turkish have always known, a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a latin1 site like this, only because it cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be displayed following latin-1, moreover, there is no HTML entities for turkish, but the case is that "I = uppercase(i)" returns false when iso-8859-9 is used.
We don't have to travel to exotic countries to see some issues when we want a list to be sorted in alphabetical order. ASCII binary order matches alphabet order, iso and utf-8 don't, it's a compromise. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse because it's a letter represented by two; double letters will always need a special treatment when ordering lists. In spain we use to see "ch" between "cg" and "ci" instead of being after all "c". MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
As we have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first. Text files don't have a header to tell which charset was used to binary represent the writing. We have several kinds of text files since html, php and .po are also text files, but at least html and .po have headers to declare to the system the codeset it's written, php don't. When the file has a header to tell the codeset, the header declaration must match with the codeset used when the file was saved. When not, it's better to use utf-8 without BOM (Byte Order Mark). Using another iso than latin1 makes the file local, reducing interchange, since the codeset must be advised. As in C or ANSI C, the code must stay in the codeset boundaries to be read by the interpreter or compiler and this must be not changed.
To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. It's time to go with webERP. webERP has a very good multi-language support based on gettext.
If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. Why should I write %% "Los Ángeles" %% when it's possible to write %% "Los Ángeles" %%. About hard code, this is an example %% S:= 'EXPEDIENTE=''' + Form12.Edit1.Text + ''''; %%
again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters. They act like control signals not data. It should be a convenient way to show them like data.
webERP default language is en_GB, charset iso-9985-1, which supports a well-known good amount of languages, some of them with well-known issues, but the program works perfectly well within the boundaries of iso-8859-1. This means a multi-language support for all those languages. But there are much more languages or, to be more precise, a few more iso sets.
Some people may want wE to work with a/some language/s not in the latin1 charset. Here we may ask ourselves if these languages belongs to an iso charset or they need more than one iso charset. The second means a very dynamic changing of charset and maybe it should be better to change to utf-8 since this is what utf-8 was designed for.
Because wE is written in english, it has a very good, although not total, compatibility with systems in other languages, but often, it happens that this compatibility makes english developers to bypass certain checks. Outside iso-8859-1 or the almost identical iso-8859-15, wE needs some tracing and debugging yet. A lot of work is being made, and some others charset have been proved to work, but not all. Some minor arrangements can be done by users, at least what php, html, Apache, MySQL and gettext is concerned to. The most difficult part is pdf, and because a reporting tool is an essential part of a database application, not solving this issue makes the application almost unusable for a limited number of people with a very special language from the system point of view.
Most wE files are saved utf-8 without BOM coded, all should be, you can check it and correct it if neccesary, this is free software. This includes all files. Some php must be edited to work with turkish and maybe others. SQL scripts and MySQL must be checked and php configured this way to support utf-8:
Deletions:
Only a few programs can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users as many apps, like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
**The System**
We all know that a charset is the binary representation of a set of characters and we also know the 7 bits lenght ASCII and that "A" is "65 decimal". Characters are not letters but graphical representations; for spanish, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but at least they are case related. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. UTF-8 comes in help to solve some issues, while introducing others. Because it's multi-byte, is more complex, less supported, and data grows in size. Some things remain the same, like sorting lists.
About the last part "locales do not solve ...", I agree. We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, now it's possible to register such domain name but the goal is to use it; and what about spaces in uri. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is around 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
When we want a list to be sorted in alphabetical order, again, we don't have to travel to exotic countries to see some issues. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives something alike güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse because it's a letter represented by two. MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As I have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first. Text files don't have a header to tell which charset was used to binary represent the writing.
More examples. If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. Why should I write ""Los &_Aacute;ngeles"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code, again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters. They act like control signals not data and should be a way to show them like data.
To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. It's time to go with webERP. webERP has a very good multi-language support based on gettext. webERP default language is en_GB charset iso-9985-1 what supports a well-known good amount of languages, some of them with well-known issues, but there are much more languages. Because wE is written in english, it has a very good compatibility with systems in other languages, but it often happens that this compatibility makes english developers to bypass certain checks, as the first 128 characters are the same for both utf-8 and the 16 iso sets. I wrote above that text files don't have a header to tell the code used to binary represent the writing, the same occurs with php files; this means that a character between the 128 from the upper part is undefined indeed until the charset is declared, and if the charset changes, the writing and its relations changes with it. A relation between characters may be (a < b) or I=uppercase(i), the second is false when iso-8859-9 is used. HTML and .po files do have a header to tell the codeset, in this case, the header must match with the codeset used when the file was saved. As in C or ANSI C, the code must stay in the codeset boundaries to be read by the interpreter or compiler and this must be not changed. Then, for a multi-language software this should be utf-8, since english in latin-1 has no total compatibility. I.e, all wE files should be coded and saved as utf-8 like most are. This includes txt, html, php, and .po files. Then, html and .po headers must match, declaring utf-8 as well. Now, MySQL, php, Apache and pdf must be properly configured.


Revision [1568]

Edited on 2009-07-31 03:14:45 by JavierDeLorenzo
Additions:
**Internationalization and Localization**
**The Goals**
Only a few programs can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for i18n in Linux and Linux Apps. Until march 2008 this was a big surprise for many users as many apps, like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
**The System**
We all know that a charset is the binary representation of a set of characters and we also know the 7 bits lenght ASCII and that "A" is "65 decimal". Characters are not letters but graphical representations; for spanish, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but at least they are case related. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. UTF-8 comes in help to solve some issues, while introducing others. Because it's multi-byte, is more complex, less supported, and data grows in size. Some things remain the same, like sorting lists.
**The Issues**
About the last part "locales do not solve ...", I agree. We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, now it's possible to register such domain name but the goal is to use it; and what about spaces in uri. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is around 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
When we want a list to be sorted in alphabetical order, again, we don't have to travel to exotic countries to see some issues. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives something alike güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse because it's a letter represented by two. MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
**Text Files**
Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As I have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first. Text files don't have a header to tell which charset was used to binary represent the writing.
**webERP**
Javier De Lorenzo-Cáceres Cruz
Deletions:
Few software can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for Linux and Linux apps i18n. Until march 2008 this was a big surprise for many users as many apps, like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
About the last part "locales do not solve ...", I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, now it's possible to register such domain name but the goal is to use it; and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As I have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first. Text files don't have a header to tell which charset was used to binary represent the writing.
Characters are not letters but graphical representations; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is around 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
UTF-8 comes in help to solve some issues, while introducing others. Because it's multi-byte, is more complex, less supported, and data grows in size. Some things remain the same. When we want a list to be sorted in alphabetical order, again, we don't have to travel to exotic countries to see some issues. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives something alike güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse because it's a letter represented by two. MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
Javvier De Lorenzo-Cáceres Cruz


Revision [1567]

Edited on 2009-07-31 00:36:23 by JavierDeLorenzo
Additions:
Few software can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but admin account and users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for Linux and Linux apps i18n. Until march 2008 this was a big surprise for many users as many apps, like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
XDG saves defaults paths in english for "Desktop" or "Downloads" in a file named "usr-dirs.defaults" for each user, while in "locale" saves the default language, i.e., the language selected during system installation or later in System > Language Support, at last, at login time, the system translates paths to the user's language and save them in "user-dirs.dirs" in the user's home folder. This customizes language and path for personal folders.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, now it's possible to register such domain name but the goal is to use it; and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As I have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first. Text files don't have a header to tell which charset was used to binary represent the writing.
To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. It's time to go with webERP. webERP has a very good multi-language support based on gettext. webERP default language is en_GB charset iso-9985-1 what supports a well-known good amount of languages, some of them with well-known issues, but there are much more languages. Because wE is written in english, it has a very good compatibility with systems in other languages, but it often happens that this compatibility makes english developers to bypass certain checks, as the first 128 characters are the same for both utf-8 and the 16 iso sets. I wrote above that text files don't have a header to tell the code used to binary represent the writing, the same occurs with php files; this means that a character between the 128 from the upper part is undefined indeed until the charset is declared, and if the charset changes, the writing and its relations changes with it. A relation between characters may be (a < b) or I=uppercase(i), the second is false when iso-8859-9 is used. HTML and .po files do have a header to tell the codeset, in this case, the header must match with the codeset used when the file was saved. As in C or ANSI C, the code must stay in the codeset boundaries to be read by the interpreter or compiler and this must be not changed. Then, for a multi-language software this should be utf-8, since english in latin-1 has no total compatibility. I.e, all wE files should be coded and saved as utf-8 like most are. This includes txt, html, php, and .po files. Then, html and .po headers must match, declaring utf-8 as well. Now, MySQL, php, Apache and pdf must be properly configured.
MySQL
Deletions:
Few software can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for Linux and Linux apps i18n. Until march 2008 this was a big surprise for many users since many apps, like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
XDG saves defaults paths in english for "Desktop" or "Downloads" in a file named "usr-dirs.defaults" for each user, while in "locale" saves the default language, i.e., the language selected during system installation or later in System > Language Support, at last, at login time, the system translates paths to the user's language and save them in "user-dirs.dirs" for that user in his home folder. This customizes language and path for personal folders.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and to use it, and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As I have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first.
To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. It's time to go with webERP. webERP has a very good multi-language support based on gettext. webERP default language is en_GB charset is iso-9985-1 a.k.a. latin-1 what supports a well-known good amount of languages, some of them with well-known issues. Because wE is written in english, it has a very good compatibility with systems in other languages, but it happens often that this compatibility makes english developers to bypass certain checks. All wE files should be utf-8 and most are, but not all. You, the reader, probably are missing somethig here, aren't you? I'll explain it in detail. A developer write code, this writing is then binary represented and saved in a file; this file is then read by an interpreter or a compiler. The file should be utf-8 coded. Now, inside the file, the developer wrote someting like charset=iso-8859-1


Revision [1566]

Edited on 2009-07-30 23:29:19 by JavierDeLorenzo
Additions:
Characters are not letters but graphical representations; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which cannot display these turkish letters or the french OE ligature. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is around 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. It's time to go with webERP. webERP has a very good multi-language support based on gettext. webERP default language is en_GB charset is iso-9985-1 a.k.a. latin-1 what supports a well-known good amount of languages, some of them with well-known issues. Because wE is written in english, it has a very good compatibility with systems in other languages, but it happens often that this compatibility makes english developers to bypass certain checks. All wE files should be utf-8 and most are, but not all. You, the reader, probably are missing somethig here, aren't you? I'll explain it in detail. A developer write code, this writing is then binary represented and saved in a file; this file is then read by an interpreter or a compiler. The file should be utf-8 coded. Now, inside the file, the developer wrote someting like charset=iso-8859-1
Deletions:
Characters are not letters but graphical representations; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which cannot display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is around 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet.


Revision [1565]

Edited on 2009-07-30 22:48:04 by JavierDeLorenzo
Additions:
Few software can change their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for Linux and Linux apps i18n. Until march 2008 this was a big surprise for many users since many apps, like Firefox 2 and Thunar, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG were then in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
XDG saves defaults paths in english for "Desktop" or "Downloads" in a file named "usr-dirs.defaults" for each user, while in "locale" saves the default language, i.e., the language selected during system installation or later in System > Language Support, at last, at login time, the system translates paths to the user's language and save them in "user-dirs.dirs" for that user in his home folder. This customizes language and path for personal folders.
About the last part "locales do not solve ...", I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remember the time when everything was written uppercase; here in spain media news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and to use it, and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN), i.e., an utf-8 domain name; human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As I have learned, there is no plain text outside the boundaries of ASCII, as always happens, now there are more than only one standard, and system needs to know the code first.
Characters are not letters but graphical representations; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which cannot display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is around 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. There are interesting discussions about the german ss and other cases. We all know this is a divine punishment for trying to make the Tower of Babel.
More examples. If we look at webERP translations we see a lot of confusion in translators like file itself charset declaration or the use of html entities in .po files. I avoid entities, hard code and escapes everytime I can. Why should I write ""Los &_Aacute;ngeles"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code, again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters. They act like control signals not data and should be a way to show them like data.
UTF-8 comes in help to solve some issues, while introducing others. Because it's multi-byte, is more complex, less supported, and data grows in size. Some things remain the same. When we want a list to be sorted in alphabetical order, again, we don't have to travel to exotic countries to see some issues. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives something alike güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse because it's a letter represented by two. MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
Javvier De Lorenzo-Cáceres Cruz
aese, aplicaciones software, s.l.
Sta. Cruz de Tenerife, Canarias.
Deletions:
There are few apps capables of changing their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for Linux and Linux apps i18n. Until march 2008 this was a big surprise for many users since many apps, like Firefox 2, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG then were in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
XDG saves defaults paths in english for "Desktop" or "Downloads" in a file named "usr-dirs.defaults" for each user, while in "locale" saves the default language, the language selected during system installation or later in System > Language Support, at last, at login, the system translates paths to the user's language and save them in "user-dirs.dirs" for that user. This customizes language and path for personal folders.
Javier De Lorenzo-Cáceres Cruz aese, aplicaciones software, s.l. Sta. Cruz de Tenerife, Canarias.
------------------
Let's begin with the last part "locales do not solve ...", I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remember the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and to use it, and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN); human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As always happens, there are more than one standard.
Characters are not letters but graphical representations; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ""<pre>Los &_Aacute;ngeles</pre>"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.
utf-8 comes in help to solve some issues, while introducing others. Some things remain the same. When we want a list to be sorted in alphabetical order, again, we don't have to travel to exotic countries. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives something alike güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse. MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
But what about the first part, what is i18n indeed? what are the major software companies doing?


Revision [1564]

Edited on 2009-07-30 22:19:24 by JavierDeLorenzo
Additions:
There are few apps capables of changing their language at run time. When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one language may be installed; XP was sold in different languages, one is bought, this is not a problem since we may change GUI, Time Zone and Region but users folders names like "Desktop" remain unchanged. Changing users folders names and more advanced things is what XDG does in Linux. XDG is the technology that is going to be the standard for Linux and Linux apps i18n. Until march 2008 this was a big surprise for many users since many apps, like Firefox 2, were not XDG compliant yet, users didn't understand the system behaviour. Information about XDG then were in pages like basedir-spec-0.6.html, in folder basedir-spec, in http://standards.freedesktop.org/ and also Xdg-Utils, a proyect from http://portland.freedesktop.org.
XDG saves defaults paths in english for "Desktop" or "Downloads" in a file named "usr-dirs.defaults" for each user, while in "locale" saves the default language, the language selected during system installation or later in System > Language Support, at last, at login, the system translates paths to the user's language and save them in "user-dirs.dirs" for that user. This customizes language and path for personal folders.
El problema se presenta en las aplicaciones no Xdg y sólo en idiomas distintos del inglés. Firefox2 y Thunar creen que la carpeta que corresponde al escritorio es Desktop, lo cual es cierto sólo en inglés. En español, Desktop es una carpeta más, aunque Thunar la represente con el icono característico del escritorio.
Javier De Lorenzo-Cáceres Cruz aese, aplicaciones software, s.l. Sta. Cruz de Tenerife, Canarias.
------------------
But what about the first part, what is i18n indeed? what are the major software companies doing?
Deletions:
But what about the first part, what is i18n indeed? what are the major software companies doing? When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one may be installed; XP was sold in different languages, one is bought, this is not


Revision [1563]

Edited on 2009-07-30 21:35:57 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ""<pre>Los &_Aacute;ngeles</pre>"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ""<pre>Los Ángeles</pre>"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.


Revision [1562]

Edited on 2009-07-30 13:15:40 by JavierDeLorenzo
Additions:
Let's begin with the last part "locales do not solve ...", I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
To here, this seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet.
But what about the first part, what is i18n indeed? what are the major software companies doing? When I bought Acrobat Standard it came with two CDs, each one for a few languages, only one may be installed; XP was sold in different languages, one is bought, this is not
Deletions:
I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
To here. This seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. to fight the problem, al-Jwarizmi show us the way.


Revision [1561]

Edited on 2009-07-30 11:44:55 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ""<pre>Los Ángeles</pre>"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ""<pre>Los &_Aacute;ngeles</pre>"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.


Revision [1560]

Edited on 2009-07-30 11:44:31 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ""<pre>Los &_Aacute;ngeles</pre>"" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los &_Aacute;ngeles" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.


Revision [1559]

Edited on 2009-07-30 07:05:55 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los &_Aacute;ngeles" (forget the underscore) when it's possible to write "Los Ángeles". About hard code , again, it's difficult to show an example, since hard code uses the same character combination than wiki. We see it all the time, special characters.
utf-8 comes in help to solve some issues, while introducing others. Some things remain the same. When we want a list to be sorted in alphabetical order, again, we don't have to travel to exotic countries. French and spanish languages have letters alike the english w, I mean a sort of dual letter (thinking on w like a double v or double u). We don't have the w in spanish but we use it, because rae.es gives something alike güisqui for the translation of whisky. But we have "ch" and "ll" and french has OE, which is worse. MySQL does a great job when it comes to order a list if you select the collation language, but this is not a runtime multi-language property we should use all the time.
To here. This seems enough introduction to the very hard goal of accomplishing a multi-language software, even an english-french. And we did not mention pdf yet. to fight the problem, al-Jwarizmi show us the way.
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los &_Aacute;ngeles" (forget the uderscore) when it's possible to write "Los Ángeles". Again, it's difficult to show an example, since hard code uses the same character combination than wiki.


Revision [1558]

Edited on 2009-07-30 06:38:19 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los &_Aacute;ngeles" (forget the uderscore) when it's possible to write "Los Ángeles". Again, it's difficult to show an example, since hard code uses the same character combination than wiki.
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ##"Los Ángeles"## when it's possible to write "Los Ángeles". Again, it's difficult to show an example, since hard code uses the same character combination than wiki.


Revision [1557]

Edited on 2009-07-30 06:36:51 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write ##"Los Ángeles"## when it's possible to write "Los Ángeles". Again, it's difficult to show an example, since hard code uses the same character combination than wiki.
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". Again, it's difficult to show an example, since hard code uses the same character combination than wiki.


Revision [1556]

Edited on 2009-07-30 06:36:13 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". Again, it's difficult to show an example, since hard code uses the same character combination than wiki.
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: ## S:= 'MyFieldName=''' + MyForm.MyEdit.Text + ''''; ## after being parsed, the filter string will get its normal look to the DB engine: ##MyFieldName = 'The string that user entered'##. aggg! It happened again


Revision [1555]

Edited on 2009-07-30 06:32:33 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: ## S:= 'MyFieldName=''' + MyForm.MyEdit.Text + ''''; ## after being parsed, the filter string will get its normal look to the DB engine: ##MyFieldName = 'The string that user entered'##. aggg! It happened again
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: <<##S:= 'MyFieldName=''' + MyForm.MyEdit.Text + '''';##<< after being parsed, the filter string will get its normal look to the DB engine: ##MyFieldName = 'The string that user entered'##. aggg! It happened again


Revision [1554]

Edited on 2009-07-30 06:31:42 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: <<##S:= 'MyFieldName=''' + MyForm.MyEdit.Text + '''';##<< after being parsed, the filter string will get its normal look to the DB engine: ##MyFieldName = 'The string that user entered'##. aggg! It happened again
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: ##S:= 'MyFieldName=''' + MyForm.MyEdit.Text + '''';## after being parsed, the filter string will get its normal look to the DB engine: ##MyFieldName = 'The string that user entered'##. aggg!


Revision [1553]

Edited on 2009-07-30 06:25:29 by JavierDeLorenzo
Additions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: ##S:= 'MyFieldName=''' + MyForm.MyEdit.Text + '''';## after being parsed, the filter string will get its normal look to the DB engine: ##MyFieldName = 'The string that user entered'##. aggg!
Deletions:
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: ## S:= 'MyFieldName=''' + MyForm.MyEdit.Text + ''''; ## after being parsed, the filter string will get its normal look to the DB engine: ## MyFieldName = 'The string that user entered' ##. aggg!


Revision [1552]

Edited on 2009-07-30 06:24:33 by JavierDeLorenzo
Additions:
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remember the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and to use it, and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN); human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't. As always happens, there are more than one standard.
Characters are not letters but graphical representations; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.
More examples. If we look at webERP translations we see a lot of confusion like file itself charset declaration or the use of html entities. I avoid entities, hard code and escapes everytime I can. Why should I write "Los Ángeles" when it's possible to write "Los Ángeles". To show an example of hard code this is a very simple in Pascal, a string S is given a value to be taken as a Local DB filter later: ## S:= 'MyFieldName=''' + MyForm.MyEdit.Text + ''''; ## after being parsed, the filter string will get its normal look to the DB engine: ## MyFieldName = 'The string that user entered' ##. aggg!
Deletions:
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and to use it, and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN); human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't.
characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.


Revision [1551]

Edited on 2009-07-30 05:47:05 by JavierDeLorenzo
Additions:
I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux has.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and to use it, and what about spaces in uri. Here at this wiki we have an example about it, the page "TraductionEnFrançais"; this is a link to an International Domain Name (IDN); human beings may guess what it was written, we can read it and we can transcode it to the original utf-8, but my system can't.
characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated. But obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, it's used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not enough for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.
Deletions:
I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux does.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and what about spaces in uri. A closer example, please look at this wiki, the page TraductionEnFrançais∞ , characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated, but obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, is used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.


Revision [1550]

Edited on 2009-07-30 05:31:06 by JavierDeLorenzo
Additions:
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and what about spaces in uri. A closer example, please look at this wiki, the page TraductionEnFrançais∞ , characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated, but obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, is used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.
Deletions:
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and what about spaces in uri. A closer example, please look at this wiki, the page TraductionEnFrançais , characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated, but obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, is used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.


Revision [1549]

Edited on 2009-07-30 05:30:20 by JavierDeLorenzo
Additions:
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and what about spaces in uri. A closer example, please look at this wiki, the page TraductionEnFrançais , characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated, but obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, is used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.
Deletions:
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and what about spaces in uri. A closer example, please look at this wiki, the page "TraductionEnFran...", characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated, but obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, is used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.


Revision [1548]

Edited on 2009-07-30 05:28:06 by JavierDeLorenzo
Additions:
"A concept related to internationalization is localization (L10N), which refers to the process of establishing information within a computer system for each combination of native language, cultural data, and coded character set (codeset). A locale is a database that provides information for a unique combination of these three components. However, locales do not solve all of the problems that localization must address. Many native languages require additional support in the form of language-specific print filters, fonts, codeset converters, character input methods, and other kinds of specialized software."
Deletions:
"A concept related to internationalization is localization (L10N), which refers to the process of establishing information within a computer system for each combination of native language, cultural data, and coded character set (codeset). A locale is a database that provides information for a unique combination of these three components. However, locales do not solve all of the problems that localization must address. Many native languages require additional support in the form of language-specific print filters, fonts, codeset converters, character input methods, and other kinds of specialized software." ---
---
---
---


Revision [1547]

Edited on 2009-07-30 05:26:51 by JavierDeLorenzo

No Differences

Revision [1546]

Edited on 2009-07-30 05:26:06 by JavierDeLorenzo
Additions:
"Internationalization refers to the process of developing programs without prior knowledge of the language, cultural data, or character-encoding schemes that the programs are expected to handle. In other words, internationalization refers to the availability and use of interfaces that let programs modify their behavior at run time for operation in a specific language environment. The abbreviation I18N is often used to stand for internationalization, as there are 18 characters between the beginning "I" and the ending "N" of that word."
I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux does.
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and what about spaces in uri. A closer example, please look at this wiki, the page "TraductionEnFran...", characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated, but obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, is used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel.
Deletions:
"Internationalization refers to the process of developing programs without prior knowledge of the language, cultural data, or character-encoding schemes that the programs are expected to handle. In other words, internationalization refers to the availability and use of interfaces that let programs modify their behavior at run time for operation in a specific language environment. The abbreviation I18N is often used to stand for internationalization, as there are 18 characters between the beginning "I" and the ending "N" of that word." ---
I agree. Just need to think on arabic right to left writing. But we don't need to go so far. I live in the Canary Islands which belong to Spain but time is one hour less. When timestamp is required for authentication some systems don't like it. I consider english the system native language and spanish the locale to change to; and this is a good starting point, a case that seems to be in the easy group, I mean, when it comes to languages, those in the iso-8859-1 a.k.a. latin-1 charset. But here in the Canaries it would be desirable to have our own locale, or at least, that Windows had our Time Zone like Linux does. ---
We all know that a charset is the binary representation of a set of characters and also know the 7 bits lenght ASCII and that "A" is "65 decimal". We remeber the time when everything was written uppercase; here in spain news were worried about the "ñ". Nowadays the issue continues, I'm not sure if it's possible to register such domain name already and what about spaces in uri. A closer example, please look at this wiki, the page "TraductionEnFran...", characters are not letters but any graphical representation; for us, "a" and "á" are the same letter. Recently, we have "Á" too, since the rule that uppercase don't get accent was deprecated, but obviously, none of "a", "á" and "Á" are "65 decimal" coded, but al least they are case related. A few days ago I discovered a turkish letter to be not compatible with major case functions. It is difficult to explain such easy thing in a site like this, only because this page is latin-1 coded which can not display these turkish letters. The "code point" that correspond to them will be decoded following latin-1, so another letter would be displayed, moreover, there is no HTML entities for turkish. ISO charsets are ASCII based, the most significant bit that in ASCII we could use for parity, here in ISO, an 8 bits lenght code, is used for adding 128 more characters. This seems enough when we think about letters, since english alphabet is about 26 and they have no accents and the like. But we soon realize it's not for multi-language characters, with their combinations and posibilities. We all know this is a divine punishment for trying to make the Tower of Babel. ---


Revision [1545]

The oldest known version of this page was created on 2009-07-30 05:23:00 by JavierDeLorenzo
Valid XHTML :: Valid CSS: :: Powered by WikkaWiki