Node:Character set, Next:, Previous:Environment variables, Up:Running Transbuild



Character set

This section describes character set and encoding issues relating to Transbuild.

A character set is a mapping which defines that a certain number represents a certain character. There are many different character sets used in the world today. Knowing which character set is being used is necessary to know which character a number it represents. The codes for ASCII is common between most character sets. However, it only defines characters up to 127, and does not have the characters necessary for many non-English languages. Numbers above 127 often represent different characters in different character sets.

A character encoding determines how that number is represented or stored in the computer. Nearly all modern computers use a 8-bit byte as its basic unit of storage, so characters sets with numbers from 0-255 are encoded as one byte per character. With larger character sets, more numbers are required beyond 255 and there are several different ways to encode those numbers in one or more bytes. Unicode is a character set with more than 255 characters.

Transbuild operates in two character set environments: the XML environment and the operating system environment. The XML format uses Unicode characters, although it can use a number of different encodings. Thankfully, the XML parser in Transbuild handles these transparently and converts them internally into Unicode.

The character set used by the operating system environment differs between operating systems and locales. The character set and encoding affects many things: it affects how file and directory names are interpreted, and what output is printed to the display.

Transbuild will run without any character set or encoding problems until information is passed between the two environments. For example, this occurs when a filename needs to be placed into XML (e.g. when the annotation source attribute is being created), or when the name of an XML element needs to be printed out (e.g. in an error message). Some of these translations must be exact (e.g. with the filename), while others can be an approximate representation (e.g. with the error message).

Firstly, Transbuild needs to know what character set and encoding the operating system environment is using. When it starts up, it determines this by examining the locale. If it cannot determine what it is, an error will be raised. You can explicitly indicate what character set and encoding is being used with the --option option. For example,

$ transbuild --option charset=iso-8859-1

The first difficulty is knowing what values are recognised. There is no standard set of names (on some systems it might accept iso-8859-1, on others 8859-1). Transbuild uses the iconv library for character set translation and encoding. In newer versions, run iconv command with a -l option to list the available encodings. In older versions, looking at the manual page for iconv and iconv_open might provide clues to encoding names.

Secondly, Transbuild needs to have a mechanism to of translate from one character set and encoding to another. Sometimes iconv might not have implemented the algorithm. At other times, such translations are impossible to do without loosing information (e.g. translating Chinese characters into the ASCII character set). Transbuild will raise an encoding error if it encounters these problem. You will need to remove, rename or change the offending text.

A more subtle problem is when the wrong character set or encoding has been used. An error might not be generated, but the information will be silently processed incorrectly. Make sure you specify the correct character set if you are setting it.

There is no easy solution when dealing with mixed character sets and encodings. One approach would be to restrict yourself to ASCII characters in both XML documents and for filenames. Alternatively, some modern operating system supports Unicode natively (a good guilde to this is the UTF-8 and Unicode FAQ for Linux/Unix.