Character Sets & Drupal

When you get down to it, computers both store data and execute operations on that data using a simple language: 1’s and 0’s. If you’re doing simple arithmetic or logic, this can be simple, but a higher level of abstraction is needed to efficiently accomplish things like text manipulation. Character sets do this by assigning a value to each character so that users can see an “A” rather than a number or a series of 1’s and 0’s.

Modern computers and programs do a good job handling character sets behind the scene, but it is beneficial to understand their inner-workings so related issues are less mysterious and easier to address. This is a discussion of charsets: issues with content management and migration, path handling, file handling, and language support.

In the 1960’s, the American Standards Association established the ASCII character set which contained encodings for the English alphabet in upper and lower-case, numbers, punctuation, and some other commonly used characters. ASCII used 7 bits of data to describe each character, meaning there were 127 total characters that could be represented. In the 70’s, new systems were produced which could handle 8 bits.  This meant that bits 128 through 255 were available to represent additional characters.  There was no standard created so there were conflicts between different countries over how to encode various non-English characters.

Here is a table showing both the 7-bit and 8-big ASCII characters:

7-bit Characters (0-127):
8-bit Characters (128-255):
In the 90’s, a group of 8-bit character sets called “ISO/IEC 8859” were created to standardize non-English characters.  ISO-8859-1, also known as Latin-1, is a slight extension of ASCII and can seen in modern databases. However, with these, you had to know which of these sets to use in order to view a particular document and it was not possible to include content spanning multiple charsets in the same documents.

Eventually, a 256-bit charset call Unicode was established, which is capable of encoding characters across all languages with plenty of space for additional characters to be added as needed. Unicode provides some backwards-compatibility since its first 127 character encodings match ASCII. Although Unicode solved the problem of not being able to account for characters across all languages, web standards and practices had already been established that limited data transfers to 8-bits. So, although browsers could handle Unicode, UTF-8 was established as a workaround. It is backwards compatible with ASCII and works by using three bytes for character encoding:

  1. The first byte consist of alphabet characters you’re using
  2. The 2nd is upper or lowercase
  3. The 3rd is each alphabet to use

Due to its wide adoption, support across browsers, and MySQL support, UTF-8 is usually a good choice for Drupal.

Database handling of charsets

When importing data into a DB, the backwards compatibility of having the first 127-characters match across multiple sets can be a curse rather than a blessing. Let’s say you’re doing an import and view a particular page that doesn’t contain any of the ASCII-compatible characters like "幸せな魚". If you visit a page that should contain "幸せな魚" but shows something along the lines of "|.^&Q!" then most likely your UTF-8 content was imported into an ISO-8859-1 DB. This problem is usually pretty obvious since the characters look nothing alike. If you do run into this problem the easiest solution is to change the DB’s charset and re-run the import. If a complete re-import isn't an option you dump the corrupted data, re-import into a UTF-8 db, and then do manual corrections of bad characters as they’re found or use a script to idenity and replace bad entries.

Charsets in URLS

Characters are everywhere we go including browser URLs. While you might not normally think of it, URLs like are in fact English letters. The specification for URLs is very limited in that they must consist only of English letters, numbers, and these: $-_.+!*’() (&$+,/:;=?@ are reserved and have special meaning.)

So what about people that want to use URLs that aren't in the English character set? For page/asset names reserved characters should be encoded in order to be included within a URL.  This is done with a “%” followed by a number denoting a character within the ISO-8859-1 charset. For example "幸せな魚" would be "%E5%B9%B8%E3%81%9B%E3%81%AA%E9%AD%9A". Be careful that slashes used to define the path are not also encoded although Apache can be configured to handle this with the “AllowEncodedSlashes” directive if you have access to modify your sever configuration.

Internationalized Domain Names use a somewhat different set of rules.

Charsets issues in browser content

Its common these days for sites to just mark all of their content at UTF-8 and not really have to worry about it. What happens if you are maintaining a site that specifies a different character set? In this case you'll want to both in the HTTP headers being sent as well as the meta tags in your HTML since these can differ. Here is more information on HTML character encoding.

Drupal Language support

We've gone over databases and HTML but we usually use a content management system or a web framework to pull these two items together. Since we are a Drupal shop we'll use Drupal 7 for our example. Drupal has support out of the box using the Locale module as well as a host of supporting third party modules. It allows you to have multiple translations of a single piece of content available at the same time. For content type-specific control, the Entity Translation module can be used which introduce a "Translate" tab.

Some key points to remember when using Drupal and multilingual:

  • Presenting translated content can be context-based or users can be provided with an option for viewing content in a different language.
  • The Location module can be used to determine a user’s geographic location to auto-select or suggest a language.
  • Content can be translated manually or by a computer. Manually translated content tends to have better accuracy but cost more.
  • The TranslateThis Button module uses JavaScript to do automated translations and supports many languages.
  • Drupal allows translated content to either share the same path or to have completely different URLs.
  • Use the Transliteration module to replace any non-standard characters for all file uploads.  

Ultimately, character sets, although somewhat complex, make life easier in a shrinking global community. We’ve gotten to the point where you can have a system that automatically handles these issues so users can focus on content. Still, having a basic understanding can help save troubleshooting time when something goes wrong.



