Technical reference manual

 
Introduction

Passing text, both as html and as plain text, can be a challenge when diacritics (like à, é, ï, ô, ñ) or other non-ascii symbols (e.g. £, €, ‑, §, ©, ®) are involved. This section, however, will show that common problems can be avoided when applying a few simple principles.

The following paragraphs will cover:


If anything on this page needs further clarification, don't hesitate to use the Feedback button at the top. We take your comments seriously and will respond within 24 hours.

Encoding

Most characters can be represented using a value that fits in one byte, 7 bits even. For decades this collection of characters and their corresponding values is known (and standardised) as the ASCII character set.

Over the years, growing internationalisation increased the need for more symbols. First attempts, using the 8th bit, led to a confusing amount of interpretations (see this Wikipedia article) that shared a severe limitation: 8 bits can only describe 256 symbols. A standardised multi-byte approach was called for.

It would have been possible to assign a fixed number of bytes to describe up to 255 characters (2 bytes) or 4095 entities (using 3 bytes): an easy to understand and transparent approach. Examples are the UCS-2 and (to some extent) UTF-16 encodings. The result, however, would not only introduce yet another, arbitrary limitation but also a tremendous waste of space: the most frequently used characters would take up 2 or 3 bytes, where one would suffice.

Welcome to the world of Unicode and variable-width encoding. Here, the first bits of a sequence indicate how many bytes (or more accurately, bits) make up the identification of a character or symbol, and how they should be interpreted.


If anything on this page needs further clarification, don't hesitate to use the Feedback button at the top. We take your comments seriously and will respond within 24 hours.

Variable-width encoding

Let us consider the following text: £ € § © ® à é ï ô ñ. The following table shows how each of these characters translates to byte sequences in two very common encodings: UTF-8 and ISO-8859-1, sometimes referred to as Latin-1.

symbol description latin-1 utf-8
£ pound sign 0xA3 0xC2 0xA3
euro sign 0xE2 0x82 0xAC
§ section sign 0xA7 0xC2 0xA7
© copyright sign 0xA9 0xC2 0xA9
® registered sign 0xAE 0xC2 0xAE
à latin small letter a with grave 0xE0 0xC3 0xA0
é latin small letter e with acute 0xE9 0xC3 0xA9
ï latin small letter i with diaeresis 0xEF 0xC3 0xAF
ô latin small letter o with circumflex 0xF4 0xC3 0xB4
ñ latin small letter n with tilde 0xF1 0xC3 0xB1
space 0x20 0x20

Note that latin-1 uses 1 byte per character (but lacks a definition of the euro-symbol), whereas utf-8 takes up 1, 2 or even 3 bytes.

The buttons below allow you to download an xml file, either in UTF-8 or ISO-8859-1 encoding. Using a hex editor, you can inspect both downloads; concentrate on the <programDescriptionText> element where you will find the byte sequences described on this page.

Download the ISO-8859-1 version Download the UTF-8 version

If anything on this page needs further clarification, don't hesitate to use the Feedback button at the top. We take your comments seriously and will respond within 24 hours.

Encoding html

When generating html output, encoding issues can usually be avoided by using html Entities. For a large majority of diacritics and symbols, entities have been defined that take the form of &name; or &number;, such as &copy;, &#x000A9; (hexadecimal) or &#169; (decimal) for the © symbol. Most programming languages targeting html or xml, provide convenient tools to generate these codes; PHP, for instance, offers the htmlentities() function. There is one pitfall, though.

When such a function is used to convert a complete block of content, and this content contains html tags (<p>, <strong>, <em>), the opening and closing brackets will be converted as well, into &lt; and &gt; respectively. It is considered best practice in xml to present text and html in CDATA sections, but in some circumstances the combination of CDATA and encoded html can lead to unexpected results.

Suppose, e.g., a targeted system receives such an xml document, notices a CDATA-section and expects its content to be unencoded html. Now the system wishes to transfer the content to a subsystem after encoding; the subsystem will receive doubly encoded html such as &amp;lt;. A browser will have difficulty, to say the least, rendering this content properly.

Slightly off-topic: exercise restraint when using html tags. Sometimes they rely on information not guaranteed to be available on the client's LMS like in <div class="yourBrilliantCss"> or <a href="yourLocalResource">. The recommended subset to be used in EDU-DEX is <p>, <br>, <b>, <i>, <ul>, <ol>, <li>, <a>, <h1> and <h2>.


If anything on this page needs further clarification, don't hesitate to use the Feedback button at the top. We take your comments seriously and will respond within 24 hours.

Encoding and xml

In xml, two encoding related contexts exist: the xml header (or preamble) which contains an encoding name, and the actual encoding of the xml feed.

The xml header

In this header, the first line of your code, you indicate the encoding of the current document It might look like <?xml version="1.0" encoding="UTF-8"?>. The consumer of the xml feed uses this information to correctly translate the byte sequences into characters and symbols.

Xml resource

This is the actual xml document, usually generated by server side applications written in php, .net, java, etcetera. The application will be configured to produce the xml using a well defined encoding. To demonstrate the interdependency between header and actual encoding, you can study 4 scenarios, each attempting to pass the symbols £ € § © ® à é ï ô and ñ in, say, a programDescriptionText element. In two of them (iso-iso and utf-utf) the encoding indicated in the header matches the actual encoding, in the other two (iso-utf and utf-iso) there is a discrepancy. The first part of each example's name relates to the header, the second part to the actual encoding. Let's have a look under the bonnet.

The first row shows the bytes (hexadecimal values) generated by the xml parser for the element programDescriptionText. The next row shows how these byte sequences are displayed in your browser. The last row provides an explanation of what is happening.

Scenario 1: utf-utf
Scenario 2: iso-iso
Scenario 3: utf-iso
Scenario 4: iso-utf

Byte array sent to browser (30 bytes)
Byte array sent to browser (34 bytes)
Byte array sent to browser (51 bytes)
0xC2 0xA3 0x20 0xE2 0x82 0xAC 0x20 0xC2 0xA7 0x20 0xC2 0xA9 0x20 0xC2 0xAE 0x20 0xC3 0xA0 0x20 0xC3 0xA9 0x20 0xC3 0xAF 0x20 0xC3 0xB4 0x20 0xC3 0xB1
0xC2 0xA3 0x20 0x26 0x23 0x38 0x33 0x36 0x34 0x3B 0x20 0xC2 0xA7 0x20 0xC2 0xA9 0x20 0xC2 0xAE 0x20 0xC3 0xA0 0x20 0xC3 0xA9 0x20 0xC3 0xAF 0x20 0xC3 0xB4 0x20 0xC3 0xB1
Parsing failes
0xC3 0x82 0xC2 0xA3 0x20 0xC3 0xA2 0xC2 0x82 0xC2 0xAC 0x20 0xC3 0x82 0xC2 0xA7 0x20 0xC3 0x82 0xC2 0xA9 0x20 0xC3 0x82 0xC2 0xAE 0x20 0xC3 0x83 0xC2 0xA0 0x20 0xC3 0x83 0xC2 0xA9 0x20 0xC3 0x83 0xC2 0xAF 0x20 0xC3 0x83 0xC2 0xB4 0x20 0xC3 0x83 0xC2 0xB1

Displayed in browser as
Displayed in browser as
Displayed in browser as
£ € § © ® à é ï ô ñ
£ € § © ® à é ï ô ñ
Parsing failes
£ € § © ® à é ï ô ñ

Encodings match, the parser doesn't complain, and the characters display fine.

Again, the encodings match and the parser is happy. But the number of bytes is larger than in the previous scenario. As so often, the source of the xml is a database table. The column used here is latin1 encoded. In this encoding, however, no position is reserved for the Euro sign. Instead, this symbol was stored using its hexadecimal representation (&#8364) taking up 6 bytes instead of 2.

What happens here is a bit more complex. The xml header announces UTF-8, but the first "interesting" byte the parser encounters, has the hexadecimal value 0xA3 - he ISO code for the £ character. The parser tries to look up this value in its internal UTF-8 table, fails to find an entry, raises an exception and stops working.

This fourth scenario leads to a less dramatic, but equally undesirable result. The parser successfully interprets the utf byte sequences into their iso counterparts leading to a valid, but garbled text.


If anything on this page needs further clarification, don't hesitate to use the Feedback button at the top. We take your comments seriously and will respond within 24 hours.

Lessons learned

As we have seen in the previous pages, the leading advice is to keep track of encoding issues from start to finish: all mismatches between actual content and indicated or implied encodings can lead to unexpected results and even severe failures.

  • The database

    Usually, the source of information will be a database object. Ascertain that the object's encoding supports the information being stored. Failure to do so will very often hide potential problems, as was demonstrated on the previous page where a Euro sign was silently translated to a hexadecimal representation. When targeting a website, this representation will probably display fine but when the targeted system expects non-html text, prepare for bizar ouput.

  • Generating xml

    A mismatch between the actual encoding (the byte sequences generated by the xml generator) and the encoding declared in the xml header, is bound to introduce effects ranging from garbled output to catastrophic failures - see the utf-iso scenario on the previous page.


If anything on this page needs further clarification, don't hesitate to use the Feedback button at the top. We take your comments seriously and will respond within 24 hours.

totop