///Internationalisation and my PHP Development Infrastructure

Internationalisation and my PHP Development Infrastructure

Introduction

The term "internationalisation" is sometimes referred to as "globalisation" or "localisation", but what does it actually mean? The following description is taken from java.sun.com:

Internationalization is the process of designing an application so that it can be adapted to various languages and regions without engineering changes. Sometimes the term internationalization is abbreviated as i18n, because there are 18 letters between the first "i" and the last "n."

An internationalized program has the following characteristics:

  • With the addition of localization data, the same executable can run worldwide.
  • Textual elements, such as status messages and the GUI component labels, are not hard coded in the program. Instead they are stored outside the source code and retrieved dynamically.
  • Support for new languages does not require recompilation.
  • Culturally-dependent data, such as dates and currencies, appear in formats that conform to the end user’s region and language.
  • It can be localized quickly.

The process of internationalisation can be as simple as replacing a string of text in one language with a string of text in another language, or it can be much more complicated involving the use of different character sets, as explained in Notes on Internationalization. In the interests of simplicity I shall limit myself to the straightforward replacement of text as this is "good enough" in most circumstances I shall encounter. As for different character sets, I have found that by changing default_charset in my php.ini file from ‘iso-8859-1’ to ‘UTF-8’, and also setting ‘content-type:text/html; charset=UTF-8’ for my HTML output, I can cover the most common eventualities.

Possible Methods

There are several ways in which text in language ‘A’ can be replaced with text language ‘B’. Before a solution can be designed it is necessary to examine the range of possibilities and weigh up the pros and cons of each one.

  1. Are you going to run strings of text through a general-purpose translator, or replace one identifiable string with another?
  2. Are you going to perform the translation/substitution as early as possible (i.e. as soon as you know what text needs to be output), or as late as possible (i.e. just before it is presented to the user)?
  3. Are you going to put text into the output area and then translate it, or translate it first and then put it into the output area?
  4. Are you going to identify each piece of text as a complete string, or give each one a smaller identity code?
  5. Are you going to store the language variations in a database or in text files?
  6. Are you going to put all the language variations into a single file, or have a separate file for each language?
  7. If you use XML and XSL to produce all HTML output (as I do in my development infrastructure) could you perform all the translation during the XSL transformation?

Design Decisions

While reviewing the possible options I made the following decisions:

  • The idea of running text, whether whole pages or small fragments, through a general-purpose translator was rejected as I have never seen a translation service that produces perfect results. I will therefore stick to the replacement of one string with another as it uses techniques that are tried and tested as well as being fast and accurate.
  • The idea of performing the translation as late as possible, which also implies working on the entire page rather than isolated fragments, was rejected as you don’t know what the page may contain therefore you would have to cycle through all the possible substitutions. This would be slow and possibly prone to error if a single word to be substituted also appeared in a group of words which required a different substitution. As I have prior experience of performing the substitution as early as possible (i.e. as soon as it is known what text needs to be output) and with small fragments, which is not only fast but also accurate, I shall stick to the same technique.
  • Next comes the choice between giving each string to be replaced a unique identity or code, or identifying the string by its complete contents. Experience has shown that a mixed approach gives the best of both worlds:
    • Where the string is short, such as 1 or 2 words only, as used in button text or field labels, then it is easier to ignore the use of a separate code and use the whole string as its identifier. For example:
      • The SUBMIT button has an identity of ‘submit’.
      • The ‘Person Name’ field label has an identity of ‘Person Name’.
    • Where the string is long, such as in error messages and form titles, then it is easier to use a separate short identifier or code. For example:
      • The form title for each script will use the script name as its identifier.
      • Each error message will be given its own unique identifier, such as ‘e1234’, and will include the ability to insert runtime values into the output text.
  • As for the choice between the database or non-database files for storing the replacement text, although modern databases can be extremely fast and efficient I also need to be able to give copies of the text in one language to a translation service so that they can be converted into a different language. With a database I would need to have additional procedures to export the data to a medium suitable for transportation to the translation service (which may be situated at a distant location), then import the converted text back into the database. By using non-database files from the start I can eliminate the need for these export and import procedures. Experiments have shown that the use of flat files for this text does not have a significant impact on performance, so it would appear to be an acceptable solution.
  • Shall I put all the language variations in a single file, or have a separate file for each language? Having previously worked on a system where all the language variations were put into a single file, which resulted in a very large and unwieldy file, I decided to have a separate file for each language.
  • Shall I perform the translation within the XSL stylesheet? Although this is a possibility I dismissed it for the following reasons:
    • It would mean performing the translation at a late stage in the proceedings, and I have already opted for an earlier stage.
    • I have no experience of performing these translations with XSL stylesheets whereas I have years of experience of performing then within my application code.
    • I am unsure of the performance impact with XSL translations, whereas my existing methods have no noticeable impact.
    • At some point in the future I may want to produce the output without using XSL, in which case I would require a non-XSL translation facility anyway.

My Implementation – Directory Structure

My main development infrastructure consists of a series of discrete subsystems each of which has its own directory in the file system. (Note that my sample application consists of just a single subsystem). Each of these directories will contain the following new subdirectories:

  • ‘help’ – for all help text. This is a departure from my existing method whereby help text is stored inside the database, but after a brief moment of reflection I decided it would be easier in the long term to maintain all translatable text in non-database files.
  • ‘text’ – for all field label, form title, error message and other text.

Each of these new directories will be further broken down into subdirectories where the subdirectory name matches a language code, such as:

  • ‘en’ – for English
  • ‘en-us’ – for English (United States)
  • ‘fr’ – for French
  • ‘fr-ca’ – for French (Canada)

All subdirectories except ‘en’ English (my native language) are optional. The files in the ‘en’ subdirectories identify every piece of text which can be translated, so these should be used as the patterns for any non-English translations.

Each language subdirectory will contain a copy of the file(s) containing the text for that particular language code. If there is no text for a particular language code then there should be no subdirectory for that language code – in other words, there should not be any language subdirectories which are empty.

I also decided to split the existing ‘screens’ directory (for the screen structure files) into language subdirectories with the intention to use these files to contain the translated field label text, but when I realised that most of the labels were duplicated I decided that it would be easier to define each field label just once in the file which contains all the other translatable text. However, I decided to keep the facility for separate versions of the screen structure files for different languages just in case it should be necessary to adjust more than the label text to suit the needs of a particular language. If no adjustments need to be made to any of the screen structure files then no additional language subdirectories will be necessary as everything will still work successfully with the original files in the ‘en’ subdirectory – the structures will be identical, but the label text will still be translated.

My Implementation – File Names

Within the ‘help’ directory there will be a separate file for each PHP script with the suffix ‘.help.txt’. For example, the script ‘person_list.php’ will have ‘person_list.php.help.txt’. In my full development infrastructure I will use the task_id instead od the script_id.

Within the ‘screens’ subdirectory there will be a separate file for each screen structure with the suffix ‘.screen.inc’.

Within the ‘text’ subdirectory there will be the following files:

  1. A file called ‘language_text.inc’ that will contain all the translations for that application in that language.
  2. A file called ‘language_array.inc’ that will contain all the arrays of values that are normally used in picklists (dropdown lists or radio groups) where the key (as referenced internally) remains consistent, but where the value (as seen by the user) may be expressed in different languages.

For each installation there will also be a ‘sys.language_text.inc’ and ‘sys.language_array.inc’ to contain all the translatable text required by the system libraries. In the sample application these will reside in the ‘sample/text/’ directory, but in my full infrastructure these will reside in the ‘menu/text/’ directory. These ‘system’ files contains the text that may be used by any of the system libraries (controller scripts, validation class, generic table class and DML class) regardless of the application in which any paricular component belongs.

This means that:

  • Each application subsystem will have its own version of ‘language_text.inc’ and ‘language_array.inc’ to contain the language text for that application. It is not necessary (nor even advisable) to have all the text for all languages for all application subsystems in a single file.
  • Each installation, which may consist of any number of application subsystems, will have a single copy of ‘sys.language_text.inc’ and ‘sys.language_array.inc’. That is because there is only a single copy of all the system libraries.

My Implementation – Determine User Language

The first step is determine the user’s language as provided by the client browser (user agent) in the $_SERVER["HTTP_ACCEPT_LANGUAGE"] variable. For this I am using the User Agent Language Detection script provided by http://techpatterns.com/. The code I use is as follows:

if (!isset($_SESSION['user_language_array'])) {

// get language codes from HTTP header
require 'language_detection.inc';
$_SESSION['user_language_array'] = get_languages();
} // if

This returns an array of entries, one for each of the languages that the user may have set in his/her browser. Each language entry is another array of 4 entries as follows:

  1. Full language abbreviation, such as: ‘en-gb’ or ‘en-us’
  2. Primary language, such as: ‘en’
  3. Full language string, such as: ‘English (United Kingdom)’ or ‘English (United States)’
  4. Primary language string, such as: ‘English’

When looking for the subdirectory containing the language text file the array of language codes provided by the User Agent (client browser) will be examined first to last. If no subdirectory exists with the same name as the language subtype (such as ‘en-us’ or ‘fr-ca’) the software will look for a subdirectory with the same name as the primary language (such as ‘en’ or ‘fr’). If nothing can be found for the first entry in the User Agent array the software will move on to the next entry.

If no subdirectory is found for any entry in the User Agent array the software will drop back to the installation’s default language. In my sample application this is hard coded as ‘en’, but in my full development environment this is configurable using the Update Control Data screen.

Within my full development environment it is also possible for each user to define his/her preferred language code in the Update User screen. This overrides both the User Agent array and the installation default language.

My Implementation – Locate Language Subdirectory

Before loading the contents of a particular file it is first necessary to locate a valid subdirectory. This is done with the following function:

function getLanguageSubDir ($path)

// get subdirectory which corresponds with user's language code.
{
// build an array of subdirectory names for specified $path
$found = array();
if (is_dir($path)) {
$dir = dir($path);
while (false !== ($entry = $dir->read())) {
if ($entry == '.' or $entry == '..') {
// ignore
} else {
if (is_dir("$path/$entry")) {
$found[] = $entry;
} // if
} // if
} // if
$dir->close();
} // if

if (!empty($found)) {
if (isset($_SESSION['user_language_array']))) {
// scan $user_language_array looking for a matching entry
foreach ($_SESSION['user_language_array'] as $language) {
// look for language subtype
if (in_array($language[0], $found)) {
return "$path/$language[0]";
} // if
// look for primary language
if (in_array($language[1], $found)) {
return "$path/$language[1]";
} // if
} // foreach
} // if
} // if

// no language specified, so default to English
return $path .'/en';

} // getLanguageSubDir

Note here that as an absolute minimum there MUST be a subdirectory for the default language, and this subdirectory MUST contain a full set of the expected files.

My Implementation – Load Screen Structure file

This uses the getLanguageSubDir()function to locate the relevant language subdirectory for the ./screens path before loading in the specified screen structure file.

function setScreenStructure ($xml_doc, $root, $screen, $xsl_file)

// extract screen structure from named file and insert details into XML document.
{
// get subdirectory which matches user's language code
$subdir = getLanguageSubDir ('./screens');

$screen = "$subdir/$screen"; // look in subirectory for this screen name
if (!file_exists($screen)) {
// 'File $screen cannot be found'
trigger_error(getLanguageText('sys0056', $screen), E_USER_ERROR);
} // if

require $screen; // import contents of disk file
.....

Note that it does not matter if the only subdirectory that exists for screen structure files is in the default language as all the field labels will be translated into the chosen language at a later stage.

My Implementation – Get Language Text

Individual pieces of translated text will be extracted from the relevant ‘language_text.inc’ or ‘sys.language_text.inc’ files using the following function:

function getLanguageText ($id, $arg1=null, $arg2=null, $arg3=null, $arg4=null, $arg5=null)

// get text from the language file and include up to 5 arguments.
{
static $array1;
static $array2;

if (!is_array($array1)) {
$array1 = array();
// include standard system text from current directory
$subdir = getLanguageSubDir ('./text');
$fname = "$subdir/sys.language_text.inc";
if (!file_exists($fname)) {
// filename does not exist
trigger_error(getLanguageText('sys0057', $fname), E_USER_ERROR);
} // if
$array1 = require_once $fname;
unset ($array);
} // if

if (!is_array($array2)) {
$array2 = array();
// include application text from current directory
$subdir = getLanguageSubDir ('./text');
$fname = "$subdir/language_text.inc";
if (!file_exists($fname)) {
// filename does not exist
trigger_error(getLanguageText('sys0057', $fname), E_USER_ERROR);
} // if
$array2 = require_once $fname;
unset ($array);
// use this language code for the HTML output
$pos = strrpos($subdir, '/');
$GLOBALS['language'] = substr($subdir, $pos +1);
} // if

// perform lookup for specified $id ($array2 first, then $array1)
if (isset($array2[$id])) {
$string = $array2[$id];
} elseif (isset($array1[$id])) {
$string = $array1[$id];
} else {
// nothing found, so return original input
return $id;
} // if

$string = convertEncoding($string, 'latin1', 'UTF-8');

if (!is_null($arg1)) {
// insert argument(s) into string
$string = sprintf($string, $arg1, $arg2, $arg3, $arg4, $arg5);
} // if

return $string;

} // getLanguageText

Please note the following:

  • The contents of ‘language_text.inc’ will be searched first for the relevant $id. If it is not found then the contents of ‘sys.language_text.inc’ will be searched.
  • The files are only read in once per HTTP request, during which the contents are loaded into memory. All subsequent accesses will read from memory without reading in the disk file again.
  • The reason that I convert the input from ‘latin1’ to ‘UTF-8’ is that during testing I discovered that accented characters needed to be converted (such as ‘é’ into ‘é’) otherwise after passing through the XML file and XSL transformation they appeared corrupted in the HTML output.
  • As well as the $id up to 5 optional arguments can be supplied. This can be used for error messages which may need to include values which are only available at run time.
  • The language found for the applicatiion text will be used as the language for the resulting output. For example, the HTML output file will contain the following line:
    <html xml:lang="??" lang="??">
    

This new function has been inserted into the following places:

  1. Inside addParams2XMLdoc() to load script titles:
    $xsl_params['title'] = getLanguageText($task_id);
    
  2. Inside setActBar() to load action buttons:
    $label = getLanguageText($label);
    
  3. Inside setMenuBar() to load menu buttons:
    $button['button_text'] = getLanguageText($button['button_text']);
    
  4. Inside setNavBar() to load navigation buttons:
    $button['button_text'] = getLanguageText($button['button_text']);
    
  5. Inside setScreenStructure() to load field labels:
    $fieldlabel = getLanguageText($fieldlabel);
    
  6. In various places for all error messages, such as:
    if (strlen($fieldvalue) > $size) {
    
    // '$fieldname cannot be > $size characters
    $this->errors[$fieldname] = getLanguageText('sys0021', $fieldname, $size);
    } // if

My Implementation – Get Language Array

Individual arrays of translated text will be extracted from the relevant ‘language_array.inc’ or ‘sys.language_array.inc’ files using the following function:

function getLanguageArray ($id)

// get named array from the language file.
{
static $array1;
static $array2;

if (!is_array($array1)) {
$array1 = array();
// include standard system text from current subdirectory
$subdir = getLanguageSubDir ('./text');
$fname = "$subdir/sys.language_array.inc";
if (!file_exists($fname)) {
// filename does not exist
trigger_error(getLanguageText('sys0057', $fname), E_USER_ERROR);
} // if
$array1 = require_once $fname;
unset ($array);
} // if

if (!is_array($array2)) {
$array2 = array();
// include application text from current directory
$subdir = getLanguageSubDir ('./text');
$fname = "$subdir/language_array.inc";
if (!file_exists($fname)) {
// filename does not exist
trigger_error(getLanguageText('sys0057', $fname), E_USER_ERROR);
} // if
$array2 = require_once $fname;
unset ($array);
} // if

// perform lookup for specified $id ($array2 first, then $array1)
if (isset($array2[$id])) {
$result = $array2[$id];
} elseif (isset($array1[$id])) {
$result = $array1[$id];
} else {
// nothing found, so return original input as an array
$result = array($id => $id);
} // if

foreach ($result as $key => $value) {
$result[$key] = convertEncoding($value, 'latin1', 'UTF-8');
} // foreach

return $result;

} // getLanguageArray

Please note the following:

  • The contents of ‘language_array.inc’ will be searched first for the relevant $id. If it is not found then the contents of ‘sys.language_array.inc’ will be searched.
  • The files are only read in once per HTTP request, during which the contents are loaded into memory. All subsequent accesses will read from memory without reading in the disk file again.

This new function should be used to obtain any array where the values will be displayed to the user. For example, instead of:

$languages = array('en' => 'English',

'es' => 'Spanish',
'fr' => 'French');

you should use the following:

$languages =  getLanguageArray('languages');

My Implementation – Handling Dates

I already use a standardised class to handle all my date validation (refer to A class for validating and formatting dates) so all I needed to do was change this:

    $this->monthalpha = array(1 => 'Jan','Feb','Mar','Apr','May','Jun',

'Jul','Aug','Sep','Oct','Nov','Dec');

or

    $this->monthalpha = array(1 => 'Janv', 'Févr', 'Mars', 'Avr', 'Mai', 'Juin',

'Juil', 'Août', 'Sept', 'Oct', 'Nov', 'Déc');

to this:

    $this->monthalpha = getLanguageArray('month_names_short');

As my default output format for dates is ‘dd Mmm yyyy’ this means that the ‘Mmm’ portion will use whatever language text is available.

Please note the following:

  • The month names should be no longer than 3 characters each otherwise the maximum length of all date and datetime fields will have to be increased to compensate.
  • The use of non-alpha characters (such as ‘.’) in the month names should be avoided as these will be treated as possible separators and not part of the name.

My Implementation – Handling Numbers

Although the English notation for numbers is to use ‘.’ (period) for the decimal separator and ‘,’ (comma) for the thousands separator there are some countries which use a different notation. Some have the two separators completely reversed, and some use ‘ ‘ (space) as the thousands separator. Regardless of any national conventions all numbers are processed within the program code, and stored within the database, in a common format. That is, the decimal point is a ‘.’ (period) and there are no thousands separators.

This means that all decimal values must be formatted before they can be output to the user, and any user input must be unformatted before it can be handled by the program.

Convert to external (user’s) format

A very important step in this process is therefore to identify all the decimal format conventions expected by the user. Fortunately all the relevant information can be provided by the localeconv() function. Unfortunately this requires the user’s actual locale to be identified first with the setlocale() function. I say ‘unfortunately’ because the input to this function is the user’s current locale or location whereas the only information available at present is the user’s preferred language as supplied in the HTTP variables. I have got round this minor annoyance by modifying the ‘languages’ array used to determine the user’s language to include a locale in the full language string, as in the following examples:

  • English (United Kingdom) [ENG]
  • English (United States) [USA]
  • French (Canada) [CAN]
  • French (France) [FRA]
  • German (Germany) [DEU]
  • Spanish (Spain) [ESP]

This means that I can now set the user’s locale using code similar to the following:

    // get full language string from first entry in user_language_array

$country = $_SESSION['user_language_array'][0][2];
// extract locale which is enclosed in '[' and ']'
if (!preg_match('?\[[^\[]+\]?', $country, $regs)) {
// 'Locale is not defined in string'
trigger_error(getLanguageText('sys0078', $country), E_USER_ERROR);
} // if
$locale = trim($regs[0], '[]');
// find out if this is a valid locale
if (!$locale = setLocale(LC_ALL, $locale)) {
// 'Cannot set locale'
trigger_error(getLanguageText('sys0079', $locale), E_USER_ERROR);
} // if

Having set the locale it is then a relatively simple exercise to convert any decimal number from internal to external format using code similar to the following:

    $decimal_places = $this->fieldspec[$fieldname]['scale'];

$locale = localeconv();
$decimal_point = $locale['decimal_point'];
$thousands_sep = $locale['thousands_sep'];
if ($thousands_sep == chr(160)) {
// change non-breaking space into ordinary space
$thousands_sep = chr(32);
} // if
$fieldvalue = number_format($fieldvalue,
$decimal_places,
$decimal_point,
$thousands_sep);

Convert to internal format

When the user presses the SUBMIT button any numbers that have been input will need to be converted back into internal format before they can be processed. This is done with code similar to the following:

function number_unformat ($input)

// convert input string into a number using settings from localeconv()
{
$locale = localeconv();
$decimal_point = $locale['decimal_point'];
$thousands_sep = $locale['thousands_sep'];
if ($thousands_sep == chr(160)) {
// change non-breaking space into ordinary space
$thousands_sep = chr(32);
} // if

$count = count_chars($input, 1);
if ($count[ord($decimal_point)] > 1) {
// too many decimal places
return $input;
} // if

// split number into 2 distinct parts
list($integer, $fraction) = explode($decimal_point, $input);

// remove thousands separator
$integer = str_replace($thousands_sep, NULL, $integer);

// join the two parts back together again
$number = $integer .'.' .$fraction;

return $number;

} // number_unformat

Conclusion

Now that the software is in place it should be simple (in theory) to cater for new languages simply by creating a new subdirectory for the language, then dropping in a set of files which contain the translated text. This system may not be able to cater for every language or locale that exists, but it will deal with the most common ones.

My sample application has been updated with all this code, so feel free to download it and try it out. Contributions of translated files will be most welcome.

References

  • W3C Internationalization (I18N) Activity
  • Common Locale Data Repository (CLDR) Project
  • Notes on Internationalization
  • The Free Standards Group Open Internationalization Initiative
  • opentag.com – a place for localization tools and technologies
  • internationalisation from FOLDOC
  • Internationalization and localization from WIKIPEDIA
  • 2010-05-25T23:12:03+00:00 July 17th, 2005|PHP|0 Comments

    About the Author:

    I have been a software engineer, both designing and developing, since 1977. I have worked with a variety of 2nd, 3rd and 4th generation languages on a mixture of mainframes, mini- and micro-computers. I have worked with flat files, indexed files, hierarchical databases, network databases and relational databases. The user interfaces have included punched card, paper tape, teletype, block mode, CHUI, GUI and web. I have written code which has been procedural, model-driven, event-driven, component-based and object oriented. I have built software using the 1-tier, 2-tier, 3-tier and Model-View-Controller (MVC) architectures. After working with COBOL for 16 years I switched to UNIFACE in 1993, starting with version 5, then progressing through version 6 to version 7. In the middle of 2002 I decided to teach myself to develop web applications using PHP and MySQL.

    Leave A Comment