settings
skip menu
change media type
xhtml+xml
change language
deutsch
scripting

URI: http://www.j-a-b.net/web/char/char-specials
last updated: 2009-12-04
© 2002-2009 Contact

up down
Subject Index

Whitespace and Formatting Characters

Whitespace

At times one needs to insert whitespace into a string of characters, e.g. when writing code or tabular data. While there is no possibility to set tab stops in HTML with any character, there are quite a few other whitespace characters. When you write a sentence you separate words by hitting the [SPACE] bar of your keyboard. Every time you do this a whitespace character is inserted into your text. Its Unicode code point (CP) is 32Dec (symbol SPACE, abbreviation: SP).

You may even insert dozens of SPACE characters consecutively. But what you see when the document is rendered as HTML is just one single space between two other characters. The SPACE character collapses. It should however be noted that the use of the <pre /> element preserves the whitespace produced by the SPACE character.

Now if you really want to part characters by more than one space you need a character which does not collapse. This character is the NON BREAKING SPACE, (symbol NON BREAKING SPACE, abbreviation: NBSP) its Unicode code point being 160Dec and its named entity being &nbsp; . Older HTML editors applied this character very often to generate whitespace, which is nowadays better achieved using CSS. Apart from the NON BREAKING SPACE other non-breaking whitespace characters exist, namely EN SPACE (nut, symbol EN SPACE, abbreviation: ENSP), EM SPACE (mutton, symbol EM SPACE, abbreviation: EMSP) and THIN SPACE (symbol THIN SPACE, abbreviation: THSP), which are wider or thinner than the normal space. You may refer to the Whitespace and Formatting Characters Chart for an example how these characters are rendered.

Hyphenation

The SOFT HYPHEN, (symbol SOFT HYPHEN, abbreviation: SHY) Unicode code point 173Dec or named entity &shy; is a character denoting possible syllabication inside a word. This character is not well supported, e.g. by Gecko browsers. Depending on the language and the actual word, another problem may occur which can be termed unintelligent hyphenation: in some languages, hyphenation changes the spelling of a word, e.g. the german word Zucker (sugar) can be split into two syllables Zuc|ker, which are spelled Zuk-ker when separated. As the SOFT HYPHEN does not know of any change in the spelling of a word, this would become Zuc-ker when separated using the SOFT HYPHEN. Thus when using this character one has to be careful where to insert it into a word in a way that no spelling errors will occur.

The following example shows how the SOFT HYPHEN works. If your browser does not support this character, or if you want to compare rendering in different browsers have a look at an popillustration. By resizing the example box you will notice how separation of words changes with box size. The current default width of the box is 75%.

change box size

Far out in the un­char­tered back­wa­ters of the un­fash­ion­a­ble end of the west­ern spi­ral arm of the Gal­axy lies a small un­re­garded yel­low sun. Or­bit­ing this at a dis­tance of roughly ninety-­two mil­lion miles is an ut­ter­ly in­sig­nif­i­cant lit­tle blue-­green plan­et whose ape-­de­scended life forms are so a­maz­ing­ly prim­i­tive that they still think dig­i­tal watch­es are a pret­ty neat idea.
Douglas Adams — The Hitchhiker's Guide to the Galaxy

The source code for the above example, SOFT HYPHENS are marked cursive green

Far out in the un&#173;char&#173; tered back&#173;wa&#173;ters of the un&#173;fash&#173;ion &#173;a&#173;ble end of the west&#173;ern spi&#173;ral arm of the Gal&#173;axy lies a small un&#173; re&#173;garded yel&#173;low sun. Or&#173;bit&#173;ing this at a dis&#173;tance of roughly ninety- &#173;two mil&#173;lion miles is an ut&#173;ter&#173;ly in&#173;sig&#173; nif&#173;i&#173; cant lit&#173;tle blue-&#173; green plan&#173;et whose ape-&#173; de&#173;scended life forms are so a&#173; maz&#173;ing&#173; ly prim&#173;i&#173; tive that they still think dig&#173;i&#173; tal watch&#173;es are a pret&#173; ty neat idea.

Letter Joining

Many languages contain characters which are initially composed of two distinct characters. Common ligatures are an example of such a joining e.g. in scandinavian languages letters a and e can be joined resulting in æ. But especially in arabic such a joining of characters is found very often.

To accomplish this you may use the character ZERO WIDTH JOINER &#8205; (symbol ZERO WIDTH JOINER, abbreviation: ZWJ). If, on the other hand, you do not want two adjacent characters capable of forming some sort of ligature to be joined, the character ZERO WIDTH NON-JOINER &#8204; (symbol ZERO WIDTH NON JOINER, , abbreviation: ZWNJ) is used.

As an example how characters are joined using ZWJ have a look at my name, Jens, written in arabic, where the distinct characters ARABIC LETTER JEEM ج [Unicode CP 1580dec], ARABIC KASRA ِ [Unicode CP 1616dec], ARABIC LETTER NOON ن [Unicode CP 1606dec], and ARABIC LETTER SEEN س [Unicode CP 1587dec] are joined together.

distinct arabic characters
The four distinct characters used to form the word Jens
joined arabic characters
Joined characters forming the word Jens (or, to be more correct, jins)

It should be noted that the Opera browser as opposed to Gecko browsers and IE does not join characters together when these characters are written inside <pre /> elements.

Direction

Some languages, like Arabic or Hebrew, are not written left-to-right but right-to-left. Changing the direction inside a string of text can be accomplished using characters LEFT-TO-RIGHT MARK, Unicode CP 8206dec (symbol LEFT-TO-RIGHT MARK, abbreviation: LRM) and RIGHT-TO-LEFT MARK, Unicode CP 8207dec (symbol RIGHT-TO-LEFT MARK, abbreviation: RLM). The correct usage of these characters is quite complicated as one has to take into account the directionality of the involved characters as well. One might be better off by achieving similar results with the dir attribute.

Some possible variations and the resulting interaction between the dir attribute, the CSS unicode-bidi attribute, the directionality of normal characters, and the LRM and RLM characters are listed subsequent to the main chart. The CSS unicode-bidi attribute can have values {normal|embed|bidi-override|inherit}.

These values denote how the inherent directionality of a character shall be treated. The chart compares the three values normal, embed, which means that the inherent directionality is retained and bidi-override which completely ignores inherent directionality in favour of the dir attribute.

Due to the complicated nature of the bidi algorithm, there is considerable difference between browsers with respect to the question, which set of rules should be applied in what order.

Subject Index

CC logo
This page is licensed under a Creative Commons License.