MSW:Text Encoding

8. Text Encoding
A text encoding is machine readable. It should make explicit features of the text that can be automatically analyzed. Text encoding exists solely to support the various processes that act upon text and should therefore make it possible to resolve hard problems in a principled and efficient manner.

8.A. Preprocessed Maximum Coordinate
The maximum coordinate is the bottom-right coordinate of the signbox. While the minimum coordinate can be determined from the symbol positioning, the maximum coordinate requires the addition of the respective width and height for each symbol. At the time of data entry, the maximum coordinate can be preprocessed and explicitly defined for the SignBox. This preprocessed value makes it possible to perform script layout without accessing an outside data source. When present, the maximum coordinate is added to the signbox definition directly after the signbox or lane marker.

8.B. Repertoire and Coded Character Set
The repertoire of the Modern SignWriting encoding consistes of 5 markers, 6 columns, 16 rows, 652 grid pages, and 500 numbers. These abstract characters can be mapped to codepoints in the 12-bit coded character set of x-Binary-SignWriting or to codepoints in the Unicode private use area.

The BSW string has been designed for an easy conversion to Unicode. Each x-Binary-SignWriting codepoint can be shifted by the hex value of FD700 to determine the Unicode PUA codepoint.

8.C. Tokens and Patterns
There are 11 tokens used with Mondern SignWriting. They can be grouped in 4 layers: the 5 structural makers (A, B, L, M, R), the 3 ranges of base symbols (w, s, P), the 2 modifiers (i, o), and the numbers (n).

A string of Modern SignWriting characters (either BSW or Unicode PUA) can be visualized as tokens rather than characters. A tokenized view replaces each character with 1 of the 11 token values. The use of tokens clarifies structures and simplifies regular expressions. See section 4.C.1. for Regular Expression Basics.

8.D. Lite Markup
Instead of binary character data or full XML, it has proven to be beneficial to use a human readable lite markup of ASCII words separated by spaces. Each word represents either a signbox or a punctuation. The lite markup has the advantage of a small size without requiring special Unicode or XML functions. Simple regular expressions can quickly and efficiently process the lite markup.

8.D.1. Structural Markers
In the lite markup, the structural markers use the token values instead of BSW or Unicode PUA.

8.D.2. Symbol Keys
In the lite markup, symbols are referenced by symbol keys: the letter 'S' followed by 5 hexadecimal values.

8.D.3. Coordinates
In the lite markup, there are 2 types of coordinates: regular fixed-width coordinates and irregular variable-width coordinates. Both types of coordinates contain 2 numbers separated by the letter 'x'.

Regular Coordinates In the lite markup, regular coordinates are always 7 ASCII characters long: 3 digits followed by the letter 'x' followed by 3 more digits. The numbers range from 250 to 749, with 500 being the center point as zero. So for regular coordinates, the string “250” is equal to the number value of -250 and “749” is equal to the number value of 249. The loose definition of regular coordinates matches numbers with 3 digits without specifying the number range. It has a regular expression of /[0-9]{3}x[0-9]{3}/. The strict definition of regular coordinates only matches numbers in the range from 250 to 749. It has a more verbose regular expression of /(2[5-9][0-9]|[3-6][0-9]{2}|7[0-4][0-9])x(249|2[5-9][0-9]|[3-6][0-9]{2}|7[0-4][0-9])/.

Irregular Coordinates In the lite markup, irregular coordinates are variable width. The numbers can be positive or negative. For negative numbers, the '-' minus sign is replaced with the letter 'n'. The two numbers in the coordinate are separated by the letter 'x'. The center coordinate of (0,0) is represented by the string '0x0'. The coordinate (-250,-250) is represented by the string 'n250xn250'.

Although signs have a coordinate number limit of -250 to 249, irregular coordinates are unbounded when used for display with compounds of multiple signs and punctuation.