New Frontiers in C


Encoding Japanese Characters

Glenn Searfoss


Glen Searfoss is the marketing/technical production manager of Data Transforms, Inc., and has been involved with graphics and typesetting software development, custom bitmapped font creation, and font consulting services since 1984. You may reach him at Data Transforms, Inc., 616 Washington St., Denver, CO 80203.

The growth of the electronic global community has called for effective communication between diverse cultures. Understanding the various communication standards used is essential when occupational requirements include the international arena. The most widely used standard, the 94 printable ASCII characters, is unworkable for countries like Japan where communication requirements exceed 94 characters. These countries require special standards.

The Japanese, in an effort to simplify their language and maintain efficient communication, set an industrial standard of 6,877 characters. They developed special ASCII coding methods to organize and access these characters. To comprehend these coding schemes, you should first understand the history of this language standardization.

A Brief History

Standardization of the written Japanese language began in the ninth century with the development of the Kana syllabary. During this period, Chinese characters were simplified and used to represent a phonetic alphabet for the complete Japanese language. Kana comprises two subsets: Katakana (angular shaped characters representing foreign names and words) and Hiragana (a cursive style representing all other Japanese characters). Due to Western influences, these scripts were further standardized in the 19th century. See Figure 1.

The official (non-phonetic) Japanese language, Kanji, comprises thousands of ideograms. To simplify the language, the Japanese Ministry of Education set the basic literacy requirements for schools to 1,850 of the most commonly used characters. In 1951, the Ministry added 92 characters to the list for use in personal names. These 92 characters were traditional name characters that had been omitted from the original standard list. Many parents had used these traditional name characters to name their children. Their refusal to exchange traditional name characters for approved characters left many young Japanese officially nameless under the new order. The Ministry of Education ended the bureaucratic confusion and civil unrest by adding the supplementary name list.

Rapid industrial growth and a more technically educated population presented a new dilemma for the Japanese. The basic literacy level, which was fine for everyday communication, was no longer adequate. The language needed more characters for specialized fields, which required precise communication both nationally and internationally. The language needed a higher standard — hence the Japanese Industrial Standard (JIS).

The Japanese Industrial Standard

The Japanese Industrial Standard code includes 6,877 actual charcters: 6,353 Kanji in two levels. Kanji level one has 2,965 characters; Kanji level two has 3,388 characters, 86 Katakana, 83 Hiragana, 10 numerals, 62 English Alphanumerics, 148 symbols, 66 Cyrillic characters, 48 Greek characters, and 32 line elements.

Kanji and non-Kanji characters are grouped, with blank characters, into sets of 94 characters. Each group receives an ASCII assignment, and each character within a group receives an individual ASCII assignment. Where known, this information is provided. See Table 1.

JIS Codes

Japan has developed two major code types (7-bit and 8-bit) to use the industrial standard and to employ 2-byte sequences to represent Japanese characters. A 2-byte-per-character coding system uses two ASCII characters to represent one Japanese character. It can encode up to 16,384 characters (128x128), but because the Japanese use only 94 printable ASCII codes, encoding is limited to a maximum of 8,836 characters (94x94).

Although the 7-bit and 8-bit codes share a 2-byte character coding scheme, their methods of shifting to a 1-byte (ASCII) character sequence are different. Seven-bit codes use escape sequences to signal a shift between 1-byte and 2-byte character sequences. There are three major escape sequence codes: NEW-JIS, OLD-JIS, and NEC CODE.

Eight-bit codes do not require escape sequences. They use the high-bit of the first byte to distinguish between 1-byte and 2-byte character sequences. These codes are called SHIFT-JIS and Extended UNIX Code (EUC).

Relative Popularity

The 2-byte, 7-bit code is the character coding commonly used for electronic communication in Japan. It can be used reliably in the United States. Seven-bit code preference in Japan is ranked in the following order: NEW-JIS (est. 1983), OLD-JIS (est. 1978), and NEC CODE. Japan does not use the 2-byte, 8-bit code as extensively as the 7-bit format. The former cannot be used reliably in the United States because 7-bit paths will strip off the eighth bit, leaving garbage. This code may be used more often in the future, but for now it is only important to recognize its existance.

Eight-Bit Japanese Codes

There are two major 8-bit codes: SHIFT-JIS (or MS-KANJI) and EUC (Extended UNIX Code or AT&T-JIS). Although their Kanji/Non-Kanji character assignments differ, these codes are implemented similarly.

The 2-byte-per-character (Kanji character) mode is initiated when the first character from the ASCII 8-bit extension is any printable ASCII character above HEX 80 (DEC 128). In this instance, the character will also be treated as the first byte of an expected 2-byte sequence. The next character, which may be any ASCII character, is treated as the second byte of the 2-byte sequence. The first byte determines the appropriate 94-character group and the second byte determines a specific character within that group.

The 1-byte-per-character (ASCII character) mode is initiated when the first character from the ASCII 8-bit extension is any printable ASCII character below HEX 80. In this case the character is then treated as a standard ASCII character. See Table 2.

Seven-Bit Japanese Codes

Implementation of the 7-bit Japanese code is different than that for the 8-bit code. The 2-byte, 7-bit Japanese code uses KANJI-IN or KANJI-OUT escape sequences to shift between the 2-byte and 1-byte-per-character mode. The 8-bit code does not.

All 7-bit Japanese codes share a common character coding system, but their KANJI-IN (KI) and KANJI-OUT (KO) escape sequences differ. Names for the codes are NEW-JIS (or JIS), NEC CODE (or NEC-KANJI), and OLD-JIS. These three codes with their escape sequences are shown in Table 3.

The KI escape sequence directs the information that follows to be treated as two bytes per character. The first byte determines the character group, and the second byte determines the character within that group. KO directs the information that follows to be treated as one byte per character (JIS-ROMAN or ASCII). These escape sequences allow on-the-fly shifting between Kanji and Roman (Romaji) or ASCII during data transmission. See Table 4.

For the purposes of this article, assume that the difference between KO (JIS-ROMAN or ROMAJI) and KO (ASCII) is very minor, and the difference between NEW-JIS and OLD-JIS is not significant.

Typefaces

The two standard typeface styles used in Japan are Minchou and Gothic. The Minchou typeface gives a hand-lettered appearance to Japanese characters. It is popular among workstations in Japan. The Gothic typeface gives block print appearance to the Japanese characters. See Figure 2.

JIS-Kanji Access Options

You can access the full JIS-Kanji using ROM Chips (hardware) or Kanji font sets (software). The hardware option requires several ROM chips to cover the complete JIS standard. If the chip footprints match those on the target hardware, it should require little or no hardware redesign. If this is not the case, major redesign will be needed.

The software option requires 77 font sets of 94 characters per set to be coresident in memory. This may require some form of memory upgrade. (The full JIS-Kanji standard of 7,238 characters with character cell sizes of 24x24 would comprise 77 sets of 94 characters per set, including blank cells. The following amount of free memory would be required to have 77 co-resident font sets: Assuming a header of 512 bytes per font set, the total number of bytes for one font are ((3 x 24) x 94) + 512 = 7,280. The total number of bytes for 77 fonts would be 7,280 x 77 = 560,560.)

Whichever JIS-Kanji access method is selected, you will need software capable of interpreting the escape sequences of Japanese 7-bit or the high-bit shift of 8-bit incoming code. This article provides the basic usage and coding background to address this task.

Bibliography

A Guide To Reading And Writing In Japanese. Charles E. Tuttle, Inc. of Rutland, Vermont and Tokyo, Japan, 1959.

Data Transforms' JIS-Kanji Font Sets And Technical Support Notes. Data Transforms, Inc., Denver, CO (303) 832-1501.

"Electronic Transfer Of Japanese Using VMS VAX." Ken R. Lunde, University of Wisconsin-Madison, August 15, 1989.

Fujitsu Mb831000 And Mb834200a Kanji ROM Specifications, Fujitsu America.

The Japanese. Edwin O. Reischauser, Charles E. Tuttle, Inc. of Rutland, Vermont and Tokyo, Japan (by special arrangement with Harvard University Press, Cambridge, MA), 1977.