URI: http://www.j-a-b.net/web/char/char-unicode
last updated: 2009-12-04
© 2002-2009 Contact
The Unicode character encoding scheme can be understood as a standardized extension of the simple 7bit ASCII charset. While ASCII just encodes 95 characters, 128 less 33 control characters, the aim of Unicode is to encode virtually all characters existing not only in today's languages, but also in dead languages, general symbols, typographical symbols (dingbats), diacritical marks, and (mathematical) formulas. As of today Unicode is able to encode the sheer amount of 1,114,112 characters.
What Unicode does not do is to describe how a character should look like,
that is, it does not define any glyph. Unicode just defines an abstract character
and allocates a code point to this character. Such a definition may be
LATIN CAPITAL LETTER A WITH DIAERESIS, the corresponding code point
is 196dez or 00C4hex. How this
character is rendered on screen or printed on paper depends on the font used,
as the following illustration shows.
The extended ASCII character encoding scheme and codepages use one byte to encode characters. Using two byte instead of one leads to 65,536 possible code points. This range is called the Basic Multilingual Plain, BMP, in Unicode. To be able to encode more characters, three or even four byte are neccessary, which would lead to a theoretical 232 = 4,294,967,296 encodable characters.
Unicode is transformed into a byte sequence using the Unicode Transformation Format,
UTF. The byte sequence is then
transferred to programmes for further processing. There exist a couple of UTF formats:
UTF-5, UTF-7, UTF-8, UTF-16 and UTF-32. The most important of these is UTF-8, which
is understood by most browsers, word-processing programmes etc.
The term UTF-8 is also used in character encoding declarations to denote
the use of Unicode character encoding.
As mentioned before a font is neccessary to transform the abstract characters into glyphs. A font maps each code-point to a glyph which means, that a fully Unicode-supporting font should be able to map more than 60,000 characters. However, most fonts only cover a very limited range, mainly code-points 31 to 255. The following chart lists fonts which are able to display a wider range of characters. Font examples are shown when clicking on a font's name.
| Font | Number of Characters | Generic Font Family | Homepage/Information | Size | Download[1] |
|---|---|---|---|---|---|
| follow these links to view a font example | |||||
| [1] The fonts listed in this chart are, with the exception of Code2000, free of charge for private use. You should consult the homepage of the providers when considering commercial usage of these fonts. | |||||
| Bitstream CyberBase | 1249 | serif |
Netscape - Public FTP Server | 171kB |
|
| Bitstream CyberBit | 29934 | serif |
Netscape - Public FTP Server | 6227kB |
|
| APL Unicode Font Extended - SImPL | ~1000 | monospace |
Vector.org - APL-Fonts | 209kB |
|
| Code2000 | 34810 | serif |
http://home.att.net/~jameskass/ Shareware | 1219kB |
|
| Fixedsys Excelsior | >4100 | monospace |
http://www.fixedsys.org/ | 233kB |
|
| Gentium | 1387 | serif |
SIL International - NRSI - Gentium | 601kB |
|
| Lucida Sans Unicode | 1776 | sans-serif |
Dept. of Phonetics & Linguistics University College London | 298kB |
|
| Titus Cyberbit Basic | 9779 | serif | TITUS | 808kB |
|
| Arial | 1320 | sans-serif |
SourceForge-Net: Smart package of Microsoft's core fonts | 542kB |
|
| Times New Roman | 1320 | serif | 647kB |
|
|
| Courier New | 1318 | monospace | 632kB |
|
|
| Verdana | 893 | sans-serif | 344kB |
|
|
| Andale Mono | 659 | monospace | 194kB |
|
|
| Arial Black | 669 | sans-serif | 165kB |
|
|
| Comic Sans MS | 574 | kursiv | 241kB |
|
|
| Georgia | 585 | serif | 384kB |
|
|
| Impact | 661 | fantasy | 170kB |
|
|
| Trebuchet MS | 576 | sans-serif | 349kB |
|
|
| Arial Unicode MS | 51180 | sans-serif | distributed with MS Office 2000, Office XP and Publisher 2002 | ||
A note on the term charset:
The W3C
recommends
using the term character encoding instead of the terms charset and character set:
Specifications SHOULD avoid using the terms 'character set' and 'charset'
to refer to a character encoding, except when the latter is used to refer to the
MIME charset parameter
or its IANA-registered
values. The term 'character encoding', or in specific cases the terms
'character encoding form' or 'character encoding scheme', are RECOMMENDED.