settings
skip menu
change media type
xhtml+xml
change language
deutsch
scripting

URI: http://www.j-a-b.net/web/char/char-codepage
last updated: 2009-12-04
© 2002-2009 Contact

up down
Subject Index

8bit-Charsets — Character Encoding

Extended ASCII Charset - Codepages

The original ASCII charset [*] is a 7bit encoding of characters. Now computers usually work with 8bit (1byte) information units. In ASCII the spare bit was used for error detection. As the need for encoding more and more characters grew throughout the sixties, this extra bit could be put to better uses in doubling the possible characters of a charset from 128 to 256.

In this way, a myriad of extended character sets were developed. To reduce imminent chaos, a variety of standards were proposed. The ANSI (former ASA) developed the Extended ASCII Character Set and the ISO published a series of character sets which grouped the european languages as well as arabic, greek, hebrew cyrillic and thai into the well-known ISO-8859 standard.

In contrast to the standardization efforts, IBM and,later, Microsoft as well, published a multitude of character sets, usually known as codepages. Unfortunately Microsoft ignored existing standards as they allocated characters to the range of characters 128 to 159, which is reserved for control characters. This practice was and is still leading to incompatibilities and confusion.

Character Encoding in HTML

Webpages usually contain characters which are outside the range of the ASCII charset. For a browser to parse and render these pages properly one has to declare the character encoding, which is used throughout the document. There exist three possibilities of setting the character encoding for webpages.

  1. you can use a <meta /> element inside the <head /> of a document
  2. the character encoding information is sent by the http-header of the server
  3. by setting an encoding-attribute in an xml-prolog

The most commonly used method is providing a <meta /> element, e.g. <meta http-equiv="content-type" content="text/html; charset=UTF-8"> The above example tells the browser that the document is an UTF-8 encoded HTML document.

If you are able to access the server settings via server-sided scripting (php, perl,...) you should consider setting the content-type of the http-header, e.g. Content-type: application/xhtml+xml; charset=utf-8

Last not least you can declare the character encoding in an xml prolog, if you write you pages in xhtml/xml, e.g. <?xml version="1.0" encoding="utf-8"?>

When contradicting declarations exist, the charset specified in the http-header has the highest priority, followed by encoding information in an xml-prolog and the least priority being that of a declaration inside a <meta /> element. While setting a character encoding through the http-header you should keep in mind that people viewing a page offline would still need an encoding information. Therefore you should always declare the character encoding of a page either in the xml-prolog or a <meta /> element, depending on the content-type (either text/html or text/xml or application/xhmtl+xml).

Character Encoding in CSS

Stylesheets should have a character encoding declaration as well as webpages. Why? - Although the syntax of a stylesheet uses only ASCII characters, attribute values and comments may include characters outside the ASCII range. Thus a missing encoding information may lead to parsing errors.

This actually happended some time ago when visiting Opera pages with the Mozilla browser. In Opera's stylesheet there were some comments with scandinavian characters but as any encoding information was missing, parser errors made this stylesheet unreadable.

Character encoding is set in a stylesheet using an at-rule, e.g. @charset: 'ISO-8859-1'; This at-rule must precede any other rules or comments.

ISO-8859

The following chart lists the ranges of the ISO-8859 charsets. You can view the characters of each charset on the following page by setting a character encoding of your choice.

Family of ISO-8859 Character Encodings
ISO-Standard Alias[1] Description
[1] for more aliases see the Charset Reference
ISO 8859-1(Latin-1)Western, western european characters, the most commonly used ISO charset
ISO 8859-2(Latin-2)Central, central and east european characters
ISO 8859-3(Latin-3)Southern, southern european and turkish characters
ISO 8859-4(Latin-4)Northern, baltic (and scandinavian) characters
ISO 8859-5Cyrillic
ISO 8859-6Arabic
ISO 8859-7Greek
ISO 8859-8Hebrew
ISO 8859-9(Latin-5)Turkish, mostly the same as Latin-1
ISO 8859-10(Latin-6)Northern, a variation of Latin-4
ISO 8859-11Thai
ISO 8859-12unused (is/was under way to cover Gaelic, Welsh and Irish)
ISO 8859-13(Latin-7)Baltic
ISO 8859-14(Latin-8)Celtic
ISO 8859-15(Latin-9)Western, a slight variant of Latin-1
ISO 8859-16(Latin-10)Central, variation (combination) of Latin-2 and Latin-1

A note on the term charset:
The W3C recommends using the term character encoding instead of the terms charset and character set:
Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED.

Subject Index

CC logo
This page is licensed under a Creative Commons License.