Skip to main content

Blog Post

Character Sets & Drupal

by Mediacurrent Team
September 24, 2012

When you get down to it, computers both store data and execute operations on that data using a simple language: 1’s and 0’s. If you’re doing simple arithmetic or logic, this can be simple, but a higher level of abstraction is needed to efficiently accomplish things like text manipulation. Character sets do this by assigning a value to each character so that users can see an “A” rather than a number or a series of 1’s and 0’s.

Modern computers and programs do a good job handling character sets behind the scene, but it is beneficial to understand their inner-workings so related issues are less mysterious and easier to address. This is a discussion of charsets: issues with content management and migration, path handling, file handling, and language support.

In the 1960’s, the American Standards Association established the ASCII character set which contained encodings for the English alphabet in upper and lower-case, numbers, punctuation, and some other commonly used characters. ASCII used 7 bits of data to describe each character, meaning there were 127 total characters that could be represented. In the 70’s, new systems were produced which could handle 8 bits.  This meant that bits 128 through 255 were available to represent additional characters.  There was no standard created so there were conflicts between different countries over how to encode various non-English characters.

Here is a table showing both the 7-bit and 8-big ASCII characters:

DECOCTHEXBINSymbolHTMLNameDescription
7-bit Characters (0-127):
00000000000000NUL� Null char
10010100000001SOH Start of Heading
20020200000010STX Start of Text
30030300000011ETX End of Text
40040400000100EOT End of Transmission
50050500000101ENQ Enquiry
60060600000110ACK Acknowledgment
70070700000111BEL Bell
80100800001000BS Back Space
90110900001001HT	 Horizontal Tab
100120A00001010LF
 Line Feed
110130B00001011VT Vertical Tab
120140C00001100FF Form Feed
130150D00001101CR
 Carriage Return
140160E00001110SO Shift Out / X-On
150170F00001111SI Shift In / X-Off
160201000010000DLE Data Line Escape
170211100010001DC1 Device Control 1 (oft. XON)
180221200010010DC2 Device Control 2
190231300010011DC3 Device Control 3 (oft. XOFF)
200241400010100DC4 Device Control 4
210251500010101NAK Negative Acknowledgement
220261600010110SYN Synchronous Idle
230271700010111ETB End of Transmit Block
240301800011000CAN Cancel
250311900011001EM End of Medium
260321A00011010SUB Substitute
270331B00011011ESC Escape
280341C00011100FS File Separator
290351D00011101GS Group Separator
300361E00011110RS Record Separator
310371F00011111US Unit Separator
320402000100000   Space
330412100100001!! Exclamation mark
340422200100010"""Double quotes (or speech marks)
350432300100011## Number
360442400100100$$ Dollar
370452500100101%% Procenttecken
380462600100110&&&Ampersand
390472700100111'' Single quote
400502800101000(( Open parenthesis (or open bracket)
410512900101001)) Close parenthesis (or close bracket)
420522A00101010** Asterisk
430532B00101011++ Plus
440542C00101100,, Comma
450552D00101101-- Hyphen
460562E00101110.. Period, dot or full stop
470572F00101111// Slash or divide
48060300011000000 Zero
49061310011000111 One
50062320011001022 Two
51063330011001133 Three
52064340011010044 Four
53065350011010155 Five
54066360011011066 Six
55067370011011177 Seven
56070380011100088 Eight
57071390011100199 Nine
580723A00111010:: Colon
590733B00111011;&#59; Semicolon
600743C00111100<&#60;&lt;Less than (or open angled bracket)
610753D00111101=&#61; Equals
620763E00111110>&#62;&gt;Greater than (or close angled bracket)
630773F00111111?&#63; Question mark
641004001000000@&#64; At symbol
651014101000001A&#65; Uppercase A
661024201000010B&#66; Uppercase B
671034301000011C&#67; Uppercase C
681044401000100D&#68; Uppercase D
691054501000101E&#69; Uppercase E
701064601000110F&#70; Uppercase F
711074701000111G&#71; Uppercase G
721104801001000H&#72; Uppercase H
731114901001001I&#73; Uppercase I
741124A01001010J&#74; Uppercase J
751134B01001011K&#75; Uppercase K
761144C01001100L&#76; Uppercase L
771154D01001101M&#77; Uppercase M
781164E01001110N&#78; Uppercase N
791174F01001111O&#79; Uppercase O
801205001010000P&#80; Uppercase P
811215101010001Q&#81; Uppercase Q
821225201010010R&#82; Uppercase R
831235301010011S&#83; Uppercase S
841245401010100T&#84; Uppercase T
851255501010101U&#85; Uppercase U
861265601010110V&#86; Uppercase V
871275701010111W&#87; Uppercase W
881305801011000X&#88; Uppercase X
891315901011001Y&#89; Uppercase Y
901325A01011010Z&#90; Uppercase Z
911335B01011011[&#91; Opening bracket
921345C01011100\&#92; Backslash
931355D01011101]&#93; Closing bracket
941365E01011110^&#94; Caret - circumflex
951375F01011111_&#95; Underscore
961406001100000`&#96; Grave accent
971416101100001a&#97; Lowercase a
981426201100010b&#98; Lowercase b
991436301100011c&#99; Lowercase c
1001446401100100d&#100; Lowercase d
1011456501100101e&#101; Lowercase e
1021466601100110f&#102; Lowercase f
1031476701100111g&#103; Lowercase g
1041506801101000h&#104; Lowercase h
1051516901101001i&#105; Lowercase i
1061526A01101010j&#106; Lowercase j
1071536B01101011k&#107; Lowercase k
1081546C01101100l&#108; Lowercase l
1091556D01101101m&#109; Lowercase m
1101566E01101110n&#110; Lowercase n
1111576F01101111o&#111; Lowercase o
1121607001110000p&#112; Lowercase p
1131617101110001q&#113; Lowercase q
1141627201110010r&#114; Lowercase r
1151637301110011s&#115; Lowercase s
1161647401110100t&#116; Lowercase t
1171657501110101u&#117; Lowercase u
1181667601110110v&#118; Lowercase v
1191677701110111w&#119; Lowercase w
1201707801111000x&#120; Lowercase x
1211717901111001y&#121; Lowercase y
1221727A01111010z&#122; Lowercase z
1231737B01111011{&#123; Opening brace
1241747C01111100|&#124; Vertical bar
1251757D01111101}&#125; Closing brace
1261767E01111110~&#126; Equivalency sign - tilde
1271777F01111111 &#127; Delete
8-bit Characters (128-255):
1282008010000000&#128;&euro;Euro sign
1292018110000001    
1302028210000010&#130;&sbquo;Single low-9 quotation mark
1312038310000011ƒ&#131;&fnof;Latin small letter f with hook
1322048410000100&#132;&bdquo;Double low-9 quotation mark
1332058510000101&#133;&hellip;Horizontal ellipsis
1342068610000110&#134;&dagger;Dagger
1352078710000111&#135;&Dagger;Double dagger
1362108810001000ˆ&#136;&circ;Modifier letter circumflex accent
1372118910001001&#137;&permil;Per mille sign
1382128A10001010Š&#138;&Scaron;Latin capital letter S with caron
1392138B10001011&#139;&lsaquo;Single left-pointing angle quotation
1402148C10001100Œ&#140;&OElig;Latin capital ligature OE
1412158D10001101    
1422168E10001110Ž&#142; Latin captial letter Z with caron
1432178F10001111    
1442209010010000    
1452219110010001&#145;&lsquo;Left single quotation mark
1462229210010010&#146;&rsquo;Right single quotation mark
1472239310010011&#147;&ldquo;Left double quotation mark
1482249410010100&#148;&rdquo;Right double quotation mark
1492259510010101&#149;&bull;Bullet
1502269610010110&#150;&ndash;En dash
1512279710010111&#151;&mdash;Em dash
1522309810011000˜&#152;&tilde;Small tilde
1532319910011001&#153;&trade;Trade mark sign
1542329A10011010š&#154;&scaron;Latin small letter S with caron
1552339B10011011&#155;&rsaquo;Single right-pointing angle quotation mark
1562349C10011100œ&#156;&oelig;Latin small ligature oe
1572359D10011101    
1582369E10011110ž&#158; Latin small letter z with caron
1592379F10011111Ÿ&#159;&yuml;Latin capital letter Y with diaeresis
160240A010100000 &#160;&nbsp;Non-breaking space
161241A110100001¡&#161;&iexcl;Inverted exclamation mark
162242A210100010¢&#162;&cent;Cent sign
163243A310100011£&#163;&pound;Pound sign
164244A410100100¤&#164;&curren;Currency sign
165245A510100101¥&#165;&yen;Yen sign
166246A610100110¦&#166;&brvbar;Pipe, Broken vertical bar
167247A710100111§&#167;&sect;Section sign
168250A810101000¨&#168;&uml;Spacing diaeresis - umlaut
169251A910101001©&#169;&copy;Copyright sign
170252AA10101010ª&#170;&ordf;Feminine ordinal indicator
171253AB10101011«&#171;&laquo;Left double angle quotes
172254AC10101100¬&#172;&not;Not sign
173255AD10101101­&#173;&shy;Soft hyphen
174256AE10101110®&#174;&reg;Registered trade mark sign
175257AF10101111¯&#175;&macr;Spacing macron - overline
176260B010110000°&#176;&deg;Degree sign
177261B110110001±&#177;&plusmn;Plus-or-minus sign
178262B210110010²&#178;&sup2;Superscript two - squared
179263B310110011³&#179;&sup3;Superscript three - cubed
180264B410110100´&#180;&acute;Acute accent - spacing acute
181265B510110101µ&#181;&micro;Micro sign
182266B610110110&#182;&para;Pilcrow sign - paragraph sign
183267B710110111·&#183;&middot;Middle dot - Georgian comma
184270B810111000¸&#184;&cedil;Spacing cedilla
185271B910111001¹&#185;&sup1;Superscript one
186272BA10111010º&#186;&ordm;Masculine ordinal indicator
187273BB10111011»&#187;&raquo;Right double angle quotes
188274BC10111100¼&#188;&frac14;Fraction one quarter
189275BD10111101½&#189;&frac12;Fraction one half
190276BE10111110¾&#190;&frac34;Fraction three quarters
191277BF10111111¿&#191;&iquest;Inverted question mark
192300C011000000À&#192;&Agrave;Latin capital letter A with grave
193301C111000001Á&#193;&Aacute;Latin capital letter A with acute
194302C211000010Â&#194;&Acirc;Latin capital letter A with circumflex
195303C311000011Ã&#195;&Atilde;Latin capital letter A with tilde
196304C411000100Ä&#196;&Auml;Latin capital letter A with diaeresis
197305C511000101Å&#197;&Aring;Latin capital letter A with ring above
198306C611000110Æ&#198;&AElig;Latin capital letter AE
199307C711000111Ç&#199;&Ccedil;Latin capital letter C with cedilla
200310C811001000È&#200;&Egrave;Latin capital letter E with grave
201311C911001001É&#201;&Eacute;Latin capital letter E with acute
202312CA11001010Ê&#202;&Ecirc;Latin capital letter E with circumflex
203313CB11001011Ë&#203;&Euml;Latin capital letter E with diaeresis
204314CC11001100Ì&#204;&Igrave;Latin capital letter I with grave
205315CD11001101Í&#205;&Iacute;Latin capital letter I with acute
206316CE11001110Î&#206;&Icirc;Latin capital letter I with circumflex
207317CF11001111Ï&#207;&Iuml;Latin capital letter I with diaeresis
208320D011010000Ð&#208;&ETH;Latin capital letter ETH
209321D111010001Ñ&#209;&Ntilde;Latin capital letter N with tilde
210322D211010010Ò&#210;&Ograve;Latin capital letter O with grave
211323D311010011Ó&#211;&Oacute;Latin capital letter O with acute
212324D411010100Ô&#212;&Ocirc;Latin capital letter O with circumflex
213325D511010101Õ&#213;&Otilde;Latin capital letter O with tilde
214326D611010110Ö&#214;&Ouml;Latin capital letter O with diaeresis
215327D711010111×&#215;&times;Multiplication sign
216330D811011000Ø&#216;&Oslash;Latin capital letter O with slash
217331D911011001Ù&#217;&Ugrave;Latin capital letter U with grave
218332DA11011010Ú&#218;&Uacute;Latin capital letter U with acute
219333DB11011011Û&#219;&Ucirc;Latin capital letter U with circumflex
220334DC11011100Ü&#220;&Uuml;Latin capital letter U with diaeresis
221335DD11011101Ý&#221;&Yacute;Latin capital letter Y with acute
222336DE11011110Þ&#222;&THORN;Latin capital letter THORN
223337DF11011111ß&#223;&szlig;Latin small letter sharp s - ess-zed
224340E011100000à&#224;&agrave;Latin small letter a with grave
225341E111100001á&#225;&aacute;Latin small letter a with acute
226342E211100010â&#226;&acirc;Latin small letter a with circumflex
227343E311100011ã&#227;&atilde;Latin small letter a with tilde
228344E411100100ä&#228;&auml;Latin small letter a with diaeresis
229345E511100101å&#229;&aring;Latin small letter a with ring above
230346E611100110æ&#230;&aelig;Latin small letter ae
231347E711100111ç&#231;&ccedil;Latin small letter c with cedilla
232350E811101000è&#232;&egrave;Latin small letter e with grave
233351E911101001é&#233;&eacute;Latin small letter e with acute
234352EA11101010ê&#234;&ecirc;Latin small letter e with circumflex
235353EB11101011ë&#235;&euml;Latin small letter e with diaeresis
236354EC11101100ì&#236;&igrave;Latin small letter i with grave
237355ED11101101í&#237;&iacute;Latin small letter i with acute
238356EE11101110î&#238;&icirc;Latin small letter i with circumflex
239357EF11101111ï&#239;&iuml;Latin small letter i with diaeresis
240360F011110000ð&#240;&eth;Latin small letter eth
241361F111110001ñ&#241;&ntilde;Latin small letter n with tilde
242362F211110010ò&#242;&ograve;Latin small letter o with grave
243363F311110011ó&#243;&oacute;Latin small letter o with acute
244364F411110100ô&#244;&ocirc;Latin small letter o with circumflex
245365F511110101õ&#245;&otilde;Latin small letter o with tilde
246366F611110110ö&#246;&ouml;Latin small letter o with diaeresis
247367F711110111÷&#247;&divide;Division sign
248370F811111000ø&#248;&oslash;Latin small letter o with slash
249371F911111001ù&#249;&ugrave;Latin small letter u with grave
250372FA11111010ú&#250;&uacute;Latin small letter u with acute
251373FB11111011û&#251;&ucirc;Latin small letter u with circumflex
252374FC11111100ü&#252;&uuml;Latin small letter u with diaeresis
253375FD11111101ý&#253;&yacute;Latin small letter y with acute
254376FE11111110þ&#254;&thorn;Latin small letter thorn
255377FF11111111ÿ&#255;&yuml;Latin small letter y with diaeresis

 

In the 90’s, a group of 8-bit character sets called “ISO/IEC 8859” were created to standardize non-English characters.  ISO-8859-1, also known as Latin-1, is a slight extension of ASCII and can seen in modern databases. However, with these, you had to know which of these sets to use in order to view a particular document and it was not possible to include content spanning multiple charsets in the same documents.

Eventually, a 256-bit charset call Unicode was established, which is capable of encoding characters across all languages with plenty of space for additional characters to be added as needed. Unicode provides some backwards-compatibility since its first 127 character encodings match ASCII. Although Unicode solved the problem of not being able to account for characters across all languages, web standards and practices had already been established that limited data transfers to 8-bits. So, although browsers could handle Unicode, UTF-8 was established as a workaround. It is backwards compatible with ASCII and works by using three bytes for character encoding:

  1. The first byte consist of alphabet characters you’re using
  2. The 2nd is upper or lowercase
  3. The 3rd is each alphabet to use

Due to its wide adoption, support across browsers, and MySQL support, UTF-8 is usually a good choice for Drupal.

Database handling of charsets

When importing data into a DB, the backwards compatibility of having the first 127-characters match across multiple sets can be a curse rather than a blessing. Let’s say you’re doing an import and view a particular page that doesn’t contain any of the ASCII-compatible characters like "幸せな魚". If you visit a page that should contain "幸せな魚" but shows something along the lines of "|.^&Q!" then most likely your UTF-8 content was imported into an ISO-8859-1 DB. This problem is usually pretty obvious since the characters look nothing alike. If you do run into this problem the easiest solution is to change the DB’s charset and re-run the import. If a complete re-import isn't an option you dump the corrupted data, re-import into a UTF-8 db, and then do manual corrections of bad characters as they’re found or use a script to idenity and replace bad entries.

Charsets in URLS

Characters are everywhere we go including browser URLs. While you might not normally think of it, URLs like http://mediacurrent.com are in fact English letters. The specification for URLs is very limited in that they must consist only of English letters, numbers, and these: $-_.+!*’() (&$+,/:;=?@ are reserved and have special meaning.)

So what about people that want to use URLs that aren't in the English character set? For page/asset names reserved characters should be encoded in order to be included within a URL.  This is done with a “%” followed by a number denoting a character within the ISO-8859-1 charset. For example "幸せな魚" would be "%E5%B9%B8%E3%81%9B%E3%81%AA%E9%AD%9A". Be careful that slashes used to define the path are not also encoded although Apache can be configured to handle this with the “AllowEncodedSlashes” directive if you have access to modify your sever configuration.

Internationalized Domain Names use a somewhat different set of rules.

Charsets issues in browser content

Its common these days for sites to just mark all of their content at UTF-8 and not really have to worry about it. What happens if you are maintaining a site that specifies a different character set? In this case you'll want to both in the HTTP headers being sent as well as the meta tags in your HTML since these can differ. Here is more information on HTML character encoding.

Drupal Language support

We've gone over databases and HTML but we usually use a content management system or a web framework to pull these two items together. Since we are a Drupal shop we'll use Drupal 7 for our example. Drupal has support out of the box using the Locale module as well as a host of supporting third party modules. It allows you to have multiple translations of a single piece of content available at the same time. For content type-specific control, the Entity Translation module can be used which introduce a "Translate" tab.

Some key points to remember when using Drupal and multilingual:

  • Presenting translated content can be context-based or users can be provided with an option for viewing content in a different language.
  • The Location module can be used to determine a user’s geographic location to auto-select or suggest a language.
  • Content can be translated manually or by a computer. Manually translated content tends to have better accuracy but cost more.
  • The TranslateThis Button module uses JavaScript to do automated translations and supports many languages.
  • Drupal allows translated content to either share the same path or to have completely different URLs.
  • Use the Transliteration module to replace any non-standard characters for all file uploads.  

Ultimately, character sets, although somewhat complex, make life easier in a shrinking global community. We’ve gotten to the point where you can have a system that automatically handles these issues so users can focus on content. Still, having a basic understanding can help save troubleshooting time when something goes wrong.

 

Sources:

http://www.ascii-code.com
http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-ch…
http://www.bluebox.net/news/2009/07/mysql_encoding
http://stackoverflow.com/questions/1344692/i-need-help-fixing-broken-ut…
http://www.phpwact.org/php/i18n/charsets
http://www.sthlmconnection.se/sv/blog/languages-and-drupal-7-what-you-n…
http://evolvingweb.ca/story/content-translation-drupal-7

Related Insights