Unicode

From PuddleNet

Jump to: navigation, search

Unicode is a big consideration for any character set. SignWriting imposes unique requirements on the model due to its spatial nature. Unicode uses a sequential character model. Sometimes, Unicode will use subscripts and superscripts to give the illusion of a spatial writing system, but the data is still a sequential list of characters without coordinate data.


Contents

[edit] Case Study

One case study to consider is the inclusion of cuneiform in Unicode. The spatial nature of cuneiform was ignored and a sequential character encoding was used. Only a portion of the cuneiform lexicon was analyzed. Each word was treated as a pictogram, or a sequential combination of pictograms. The encoding is complex and incomplete.

The lexicon of SignWriting can not be reduced or analyzed because it is used for the world's sign languages which are constantly in flux within communities and between generations. The only way to encode SignWriting is as a spatial script.

The character encoding form of Unicode is a remnant of ASCII compatibility. The most popular form of Unicode is UTF-8. Single byte codes are used to represent 2 byte codes. Many of the more recent encodings use 3 byte codes. These codes are complex and represent sequential character data.

This problem is understood as the difference between using Unicode Strings and Binary Data. An example can be taken from the Requests for new languages/Wikipedia American Sign Language 2. It was recently set to verified as eligible by Jon Harald Søby. The problem becomes evident with his last name which appears as S%C3%B8by in UTF-8.

The Unicode string is not viewed directly, but a visual representation of the string is more pleasing. The font rendering is done on the client side. Modules and packages can be loaded to view uncommon scripts.

Any encoding of sign language data must have a UTF-8 representation. However, as we learned from various programming languages, there are Unicode strings and then there's binary data. Internally, any application that uses sign language data will use binary data to parse and analyze. The Unicode string is only for transferring data.


[edit] The Plane 4 Solution

The plane 4 solution makes it possible to convert sign language data encoded in the x-iswa-2008 into a Unicode string. View the Hello world page for an example of a real implementation.

The plane 4 solution uses an entire 16-bit Unicode plane for character mapping, specifically plane 4, which is the same size as the x-iswa-2008. The plane 4 solution was created after the sign language data model. All consideration was given to fulfilling the requirements of sign language data: using the ISWA 2008 with spatial information. The plane 4 solution is as simple as possible without consideration of Unicode's definitions and restrictions. There may be more complex solutions that fit Unicode's paradigm better; however, any solution must be able to represent the entirety of the ISWA 2008 and spatial information. Even better would be an encoding/decoding of the current sign language data.

Currently, sign language data uses a coordinated based writing system. You can see this in action with the SignWriting Image Server and with the encoding of Binary SignWriting.

Just because we can represent sign language data with UTF-8 doesn't mean that we use Unicode internally. The binary data must fulfill the model and have an equivalent Unicode representation. Binary SignWriting is a robust encoding model that satisfied the requirements and is Unicode compatible.


[edit] UTF-8

The first symbol is code 256, hex 0100, or 4 bytes in UTF-8: f1 80 84 80;

The first character of Unicode Plane 4 is f1 80 80 80. Each segment goes from 80 to bf in hex. This is a range of 64.

Code Hex UTF-8
0 0000 f1 80 80 80
63 003f f1 80 80 bf
64 0040 f1 80 81 80
256 0100 f1 80 84 80

The encoding and decoding of the x-iswa-2008 coded character set can be defined from this convention.

 function iswa2unicode($char){
   $code = hexdec($char);
   $a = $code%64;
   $b = floor($code/64);
   $c = floor($b/64);
   $b -= $c*64;
  
   $utf8[]  = "f1";
   $utf8[]  = dechex($c + 128);
   $utf8[]  = dechex($b + 128);
   $utf8[]  = dechex($a + 128);
   return "%" . implode("%",$utf8);
 }

[edit] Looking toward the future

The integration of a spatial script within all of the existing technologies that use Unicode is a daunting task that will take time.

Besides SignWriting, there are several scripts that are spatial in nature. These other scripts will benefit tremendously if and when the spatial script model has been integrated with Unicode.

Personal tools