skip menu
change media type
change language

last updated: 1970-01-01
© 2002-2009 Contact

up down
Subject Index

The Unicode Charset — Introduction

The Unicode character encoding scheme can be understood as a standardized extension of the simple 7bit ASCII charset. While ASCII just encodes 95 characters, 128 less 33 control characters, the aim of Unicode is to encode virtually all characters existing not only in today's languages, but also in dead languages, general symbols, typographical symbols (dingbats), diacritical marks, and (mathematical) formulas. As of today Unicode is able to encode the sheer amount of 1,114,112 characters.

What Unicode does not do is to describe how a character should look like, that is, it does not define any glyph. Unicode just defines an abstract character and allocates a code point to this character. Such a definition may be LATIN CAPITAL LETTER A WITH DIAERESIS, the corresponding code point is 196dez or 00C4hex. How this character is rendered on screen or printed on paper depends on the font used, as the following illustration shows.

The character 'LATIN CAPITAL LETTER A WITH DIAERESIS' rendered in a variety of fonts

The extended ASCII character encoding scheme and codepages use one byte to encode characters. Using two byte instead of one leads to 65,536 possible code points. This range is called the Basic Multilingual Plain, BMP, in Unicode. To be able to encode more characters, three or even four byte are neccessary, which would lead to a theoretical 232 = 4,294,967,296 encodable characters.

Unicode is transformed into a byte sequence using the Unicode Transformation Format, UTF. The byte sequence is then transferred to programmes for further processing. There exist a couple of UTF formats: UTF-5, UTF-7, UTF-8, UTF-16 and UTF-32. The most important of these is UTF-8, which is understood by most browsers, word-processing programmes etc. The term UTF-8 is also used in character encoding declarations to denote the use of Unicode character encoding.

As mentioned before a font is neccessary to transform the abstract characters into glyphs. A font maps each code-point to a glyph which means, that a fully Unicode-supporting font should be able to map more than 60,000 characters. However, most fonts only cover a very limited range, mainly code-points 31 to 255. The following chart lists fonts which are able to display a wider range of characters. Font examples are shown when clicking on a font's name.

Unicode Fonts & Sources
Font Number of Characters Generic Font Family Homepage/Information Size Download[1]
follow these links to view a font example
[1] The fonts listed in this chart are, with the exception of Code2000, free of charge for private use. You should consult the homepage of the providers when considering commercial usage of these fonts.
Bitstream CyberBase 1249serif Netscape - Public FTP Server
171kB download
Bitstream CyberBit 29934serif Netscape - Public FTP Server
6227kB download
APL Unicode Font Extended - SImPL ~1000monospace - APL-Fonts
209kB download
Code2000 34810serif
1219kB download
Fixedsys Excelsior >4100monospace
233kB download
Gentium 1387serif SIL International - NRSI - Gentium
601kB download
Lucida Sans Unicode 1776sans-serif Dept. of Phonetics & Linguistics University College London
298kB download
Titus Cyberbit Basic 9779serif TITUS808kB download
Arial 1320sans-serif SourceForge-Net:
Smart package of Microsoft's core fonts

542kB download
Times New Roman 1320serif647kB download
Courier New 1318monospace632kB download
Verdana 893sans-serif344kB download
Andale Mono 659monospace194kB download
Arial Black 669sans-serif165kB download
Comic Sans MS 574kursiv241kB download
Georgia 585serif384kB download
Impact 661fantasy170kB download
Trebuchet MS 576sans-serif349kB download
Arial Unicode MS 51180sans-serifdistributed with MS Office 2000, Office XP and Publisher 2002 

A note on the term charset:
The W3C recommends using the term character encoding instead of the terms charset and character set:
Specifications SHOULD avoid using the terms 'character set' and 'charset' to refer to a character encoding, except when the latter is used to refer to the MIME charset parameter or its IANA-registered values. The term 'character encoding', or in specific cases the terms 'character encoding form' or 'character encoding scheme', are RECOMMENDED.

Subject Index

CC logo
This page is licensed under a Creative Commons License.