Each character in the
Unicode standard is assigned a property known as its
Bidirectional Category, which is used to determine the
left to right or
right to left positioning of text. When a run of text of one directionality is nested within a run of text of the other directionality (sometimes several levels deep) things can become quite complicated. See
Bidirectional Behavior and Unicode Standard Annex #9 The
Bidirectional Algorithm.
.
The Bidirectional Categories are partitioned into three groups, Strong, Weak and Neutral.
The Strong categories are
- L
- Left to Right
- Includes most alphabetic, syllabic, Han ideographic characters, digits that are neither European nor Arabic, the LRM character U+200E left to right mark and all unassigned characters except in the ranges (U+0590 to U+05FF and U+FB1D to U+FB4F) and (U+0600 to U+07BF, U+FB50 to U+FDFF and U+FE70 to U+FEFF)
- LRE
- Left to Right Embedding
- Includes only the LRE character U+202A left to right embedding
- LRO
- Left to Right Override
- Includes only the LRO character U+202D left to right override
- R
- Right to Left
- Includes the Hebrew alphabet, most punctuation specific to that script, all unassigned characters in the ranges (U+0590 to U+05FF and U+FB1D to U+FB4F) and the RLM character U+200F right to left mark
- AL
- Right to Left Arabic
- Includes the Arabic, Thaana, and Syriac alphabets, most punctuation specific to those scripts and all unassigned characters in the ranges (U+0600 to U+07BF, U+FB50 to U+FDFF and U+FE70 to U+FEFF)
- RLE
- Right to Left Embedding
- Includes only the RLE character U+202B right to left embedding
- RLO
- Right to Left Override
- Includes only the RLO character U+202E right to left override
The Weak categories are
- PDF
- Pop Directional Format
- Includes only the PDF character U+202C pop directional formatting
- EN
- European Number
- Includes European digits, Eastern Arabic-Indic digits, ...
- ES
- European Number Separator
- Includes only U+002F / solidus
- ET
- European Number Terminator
- Includes Plus Sign, Minus Sign, Degree, Currency symbols, ...
- AN
- Arabic Number
- Includes Arabic-Indic digits, Arabic decimal & thousands separators, ...
- CS
- Common Number Separator
- Includes Colon, Comma, Full Stop (Period), Non breaking space, ...
- NSM
- Non Spacing Mark
- Includes characters with General category Mn (Non Spacing Mark) and Me (Enclosing Mark).
- BN
- Boundary Neutral
- Formatting and control characters, other than those explicitly given types above.
The Neutral categories are
- B
- Paragraph Separator
- U+2029 paragraph separator, appropriate Newline Functions, higher-protocol paragraph determination.
- S
- Segment Separator
- Includes only U+0009 character tabulation
- WS
- Whitespace
- Space, Figure Space, Line Separator, Form Feed, General Punctuation Spaces, ...
- ON
- Other Neutrals
- All other characters, including U+FFFC object replacement character
The term European digits is used to refer to decimal forms common in Europe and elsewhere, and Arabic-Indic digits to refer to the native Arabic forms.
Unassigned characters are given strong types in the algorithm. This is an explicit exception to the general Unicode conformance requirements with respect to unassigned characters. As characters become assigned in the future, these bidirectional types may change.
Private use characters can be assigned different values by a conformant implementation.
For the purpose of the bidirectional algorithm, inline objects (such as graphics) are treated as if they are U+FFFC object replacement character.
The Bidirectional Algorithm runs, in excruciatingly incomplete summary :
- At a paragraph break, reset everything
- Looking only at characters with a strong category, determine basic nesting levels.
- Looking only at characters with a weak category, let them inherit a nearby directionality, or change their type to a neutral one.
- Looking only at characters with a neutral category, let them inherit a nearby directionality
http://unicode.org