URI: http://www.j-a-b.net/web/char/char-general
last updated: 2009-12-04
© 2002-2009 Contact
The correct display of characters in browsers is important in view of both readability and internationalization (I18N) efforts. The following pages outline possibilities of character encoding and elucidate often confused terms.
In the early days of computing texts were mostly written in English using a character set containing uppercase and lowercase letters of the latin alphabet plus digits 0 to 9, punctuation characters, and a few special characters. Depending on the machines in use, either the EBCDIC or the ASCII character encoding were used. Nowadays EBCDIC encoding has become obsolete but ASCII is still widely used in computing and has become the basis of extended character sets.
Beginning in the sixties and proceeding far into eighties of the last century more and more character sets and codepages were developed to include languages with non-latin alphabets. This led to growing incompatibilities in text processing. The need for a unified and standardized character encoding became more and more urgent. Thus the Unicode Project was born around 1988. Since then it has become the organization for developing a standardized character encoding. This Unicode standard is currently available in its fourth edition.
A character set is a subsumption of abstract characters. These abstract characters
can be defined by their encoding or by a formal description. The character
ARABIC LETTER GAF WITH THREE DOTS ABOVE is for example an abstract entity of
the arabic character set. This entity is encoded as
ڴdec,
or ڴhex respectively, using Unicode.
A computer is now able to map this code to a table of glyphs. This relation between
a character encoding and glyphs is called a font.
Thus an abstract character entity, i.e. the character Ä (Ä)
may lead to all sorts of different presentations as a glyph,
depending on the font in use. Unfortunately many fonts only map a very limited part
of the entire Unicode characters.
A note on the term charset:
The W3C
recommends
using the term character encoding instead of the terms charset and character set:
Specifications SHOULD avoid using the terms 'character set' and 'charset'
to refer to a character encoding, except when the latter is used to refer to the
MIME charset parameter
or its IANA-registered
values. The term 'character encoding', or in specific cases the terms
'character encoding form' or 'character encoding scheme', are RECOMMENDED.