URI: http://www.j-a-b.net/web/char/char-specials
last updated: 2009-12-04
© 2002-2009 Contact
At times one needs to insert whitespace into a string of characters, e.g. when
writing code or tabular data. While there is no possibility to set tab stops in
HTML with any character, there are
quite a few other whitespace characters. When you write a sentence you separate
words by hitting the [SPACE] bar of your keyboard. Every time you do
this a whitespace character is inserted into your text.
Its Unicode code point (CP) is 32Dec
(symbol ,
abbreviation: SP).
You may even insert dozens of SPACE characters consecutively.
But what you see when the document is rendered as HTML is just one single space
between two other characters. The SPACE character collapses. It should
however be noted that the use of the <pre /> element preserves
the whitespace produced by the SPACE character.
Now if you really want to part characters by more than one space you need a character
which does not collapse. This character is the NON BREAKING SPACE,
(symbol ,
abbreviation: NBSP) its Unicode code point
being 160Dec and its
named entity being
. Older HTML editors applied this
character very often to generate whitespace, which is nowadays better achieved
using CSS. Apart from the NON BREAKING SPACE
other non-breaking whitespace characters exist, namely
EN SPACE (nut, symbol , abbreviation: ENSP),
EM SPACE (mutton, symbol , abbreviation: EMSP)
and
THIN SPACE (symbol , abbreviation: THSP),
which are wider or thinner than the normal space. You may refer to the
Whitespace and Formatting Characters Chart for an example how these characters are rendered.
The SOFT HYPHEN, (symbol , abbreviation: SHY)
Unicode code point 173Dec or named entity
­ is a
character denoting possible syllabication inside a word. This character is not well
supported, e.g. by Gecko browsers. Depending on the language and the actual word,
another problem may occur which can be termed unintelligent hyphenation:
in some languages, hyphenation changes the spelling of a word, e.g. the german word
Zucker (sugar) can be split into two syllables
Zuc|ker, which are spelled Zuk-ker
when separated. As the SOFT HYPHEN does not know of any change in the
spelling of a word, this would become Zuc-ker when separated
using the SOFT HYPHEN. Thus when using this character one has to be
careful where to insert it into a word in a way that no spelling errors will occur.
The following example shows how the SOFT HYPHEN works. If your browser
does not support this character, or if you want to compare rendering in different
browsers have a look at an
illustration.
By resizing the example box you will notice how separation of words changes with
box size. The current default width of the box is 75%.
Far out in the unchartered backwaters of the unfashionable end of the western spiral arm of the Galaxy lies a small unregarded yellow sun. Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue-green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.
Douglas Adams — The Hitchhiker's Guide to the Galaxy
The source code for the above example, SOFT HYPHENS are marked
cursive green
Far out in the un­char­
tered back­wa­ters
of the un­fash­ion
­a­ble end of the
west­ern spi­ral
arm of the Gal­axy lies a small un­
re­garded yel­low
sun.
Or­bit­ing
this at a dis­tance of roughly ninety-
­two mil­lion
miles is an ut­ter­ly
in­sig­
nif­i­
cant lit­tle blue-­
green plan­et whose ape-­
de­scended life forms are so a­
maz­ing­
ly prim­i­
tive that they still think dig­i­
tal watch­es are a pret­
ty neat idea.
Many languages contain characters which are initially composed of two distinct characters. Common ligatures are an example of such a joining e.g. in scandinavian languages letters a and e can be joined resulting in æ. But especially in arabic such a joining of characters is found very often.
To accomplish this you may use the character ZERO WIDTH JOINER ‍
(symbol ,
abbreviation: ZWJ).
If, on the other hand, you do not want two adjacent characters capable of forming
some sort of ligature to be joined, the character
ZERO WIDTH NON-JOINER ‌
(symbol ,
, abbreviation: ZWNJ) is used.
As an example how characters are joined using ZWJ have a look at my name, Jens, written in arabic,
where the distinct characters
ARABIC LETTER JEEM ج [Unicode CP 1580dec],
ARABIC KASRA ِ [Unicode CP 1616dec],
ARABIC LETTER NOON ن [Unicode CP 1606dec], and
ARABIC LETTER SEEN س [Unicode CP 1587dec]
are joined together.
It should be noted that the Opera browser as opposed to Gecko browsers and IE does
not join characters together when these characters are written inside
<pre /> elements.
Some languages, like Arabic or Hebrew, are not written left-to-right but right-to-left.
Changing the direction inside a string of text can be accomplished using characters
LEFT-TO-RIGHT MARK, Unicode CP 8206dec
(symbol ,
abbreviation: LRM) and
RIGHT-TO-LEFT MARK, Unicode CP 8207dec
(symbol ,
abbreviation: RLM).
The correct usage of these characters is quite complicated as one has to take into
account the directionality of the involved characters as well. One might be better
off by achieving similar results with the
dir attribute.
Some possible variations and the resulting interaction between the dir
attribute, the CSS unicode-bidi
attribute, the directionality of normal characters, and the LRM and RLM characters
are listed
subsequent to the main chart. The CSS unicode-bidi attribute can
have values {normal|embed|bidi-override|inherit}.
These values denote how the inherent directionality of a character shall be treated.
The chart compares the three values normal, embed, which
means that the inherent directionality is retained and bidi-override
which completely ignores inherent directionality in favour of the dir
attribute.
Due to the complicated nature of the bidi algorithm, there is considerable difference between browsers with respect to the question, which set of rules should be applied in what order.