Each character in the Unicode standard is assigned a property known as its Bidirectional Category, which is used to determine the left to right or right to left positioning of text. When a run of text of one directionality is nested within a run of text of the other directionality (sometimes several levels deep) things can become quite complicated. See Bidirectional Behavior and Unicode Standard Annex #9 The Bidirectional Algorithm. .

The Bidirectional Categories are partitioned into three groups, Strong, Weak and Neutral.

 

The Strong categories are

L
Left to Right
Includes most alphabetic, syllabic, Han ideographic characters, digits that are neither European nor Arabic, the LRM character U+200E left to right mark and all unassigned characters except in the ranges (U+0590 to U+05FF and U+FB1D to U+FB4F) and (U+0600 to U+07BF, U+FB50 to U+FDFF and U+FE70 to U+FEFF)
LRE
Left to Right Embedding
Includes only the LRE character U+202A left to right embedding
LRO
Left to Right Override
Includes only the LRO character U+202D left to right override
R
Right to Left
Includes the Hebrew alphabet, most punctuation specific to that script, all unassigned characters in the ranges (U+0590 to U+05FF and U+FB1D to U+FB4F) and the RLM character U+200F right to left mark
AL
Right to Left Arabic
Includes the Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts and all unassigned characters in the ranges (U+0600 to U+07BF, U+FB50 to U+FDFF and U+FE70 to U+FEFF)
RLE
Right to Left Embedding
Includes only the RLE character U+202B right to left embedding
RLO
Right to Left Override
Includes only the RLO character U+202E right to left override

 

The Weak categories are

PDF
Pop Directional Format
Includes only the PDF character U+202C pop directional formatting  
EN
European Number
Includes European digits, Eastern Arabic-Indic digits, ...
ES
European Number Separator
Includes only U+002F  /  solidus
ET
European Number Terminator
Includes Plus Sign, Minus Sign, Degree, Currency symbols, ...
AN
Arabic Number
Includes Arabic-Indic digits, Arabic decimal & thousands separators, ...
CS
Common Number Separator
Includes Colon, Comma, Full Stop (Period), Non breaking space, ...
NSM
Non Spacing Mark
Includes characters with General category Mn (Non Spacing Mark) and Me (Enclosing Mark).
BN
Boundary Neutral
Formatting and control characters, other than those explicitly given types above.

 

The Neutral categories are

B
Paragraph Separator
U+2029 paragraph separator, appropriate Newline Functions, higher-protocol paragraph determination.
S
Segment Separator
Includes only U+0009 character tabulation
WS
Whitespace
Space, Figure Space, Line Separator, Form Feed, General Punctuation Spaces, ...
ON
Other Neutrals
All other characters, including U+FFFC object replacement character

 

The term European digits is used to refer to decimal forms common in Europe and elsewhere, and Arabic-Indic digits to refer to the native Arabic forms.

Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change.

Private use characters can be assigned different values by a conformant implementation.

For the purpose of the bidirectional algorithm, inline objects (such as graphics) are treated as if they are U+FFFC object replacement character.

The Bidirectional Algorithm runs, in excruciatingly incomplete summary :

 

  1. At a paragraph break, reset everything
  2. Looking only at characters with a strong category, determine basic nesting levels.
  3. Looking only at characters with a weak category, let them inherit a nearby directionality, or change their type to a neutral one.
  4. Looking only at characters with a neutral category, let them inherit a nearby directionality

http://unicode.org

Log in or register to write something here or to contact authors.