settings
skip menu
change media type
xhtml+xml
change language
deutsch
scripting

URI: http://www.j-a-b.net/web/char/char-general
last updated: 2009-12-04
© 2002-2009 Contact

up down
Subject Index

Character Encoding and Fonts

The correct display of characters in browsers is important in view of both readability and internationalization (I18N) efforts. The following pages outline possibilities of character encoding and elucidate often confused terms.

In the early days of computing texts were mostly written in English using a character set containing uppercase and lowercase letters of the latin alphabet plus digits 0 to 9, punctuation characters, and a few special characters. Depending on the machines in use, either the EBCDIC or the ASCII character encoding were used. Nowadays EBCDIC encoding has become obsolete but ASCII is still widely used in computing and has become the basis of extended character sets.

Beginning in the sixties and proceeding far into eighties of the last century more and more character sets and codepages were developed to include languages with non-latin alphabets. This led to growing incompatibilities in text processing. The need for a unified and standardized character encoding became more and more urgent. Thus the Unicode Project was born around 1988. Since then it has become the organization for developing a standardized character encoding. This Unicode standard is currently available in its fourth edition.

A character set is a subsumption of abstract characters. These abstract characters can be defined by their encoding or by a formal description. The character popup ARABIC LETTER GAF WITH THREE DOTS ABOVE is for example an abstract entity of the arabic character set. This entity is encoded as ڴdec, or ڴhex respectively, using Unicode. A computer is now able to map this code to a table of glyphs. This relation between a character encoding and glyphs is called a font. Thus an abstract character entity, i.e. the character Ä (Ä) may lead to all sorts of different presentations as a glyph, depending on the font in use. Unfortunately many fonts only map a very limited part of the entire Unicode characters.

A note on the term charset:
The W3C recommends using the term character encoding instead of the terms charset and character set:
Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED.

Subject Index

CC logo
This page is licensed under a Creative Commons License.