Truncating Japanese Text

CS-Web: A Lightweight Summarizer for HTML
The Perl Journal, Winter 1999

Truncating Japanese Text

Canon is a Japanese company, with Japanese text on many of its web pages. Japanese text is usually encoded in one of several possible multibyte encoding schemes, and some of these schemes use variable numbers of bytes to represent single Japanese characters, or intermingle Japanese and regular ASCII characters. This was a problem.

The summaries generated by Text::Summary are truncated at a fixed length, and this length is specified in bytes, not characters. If Japanese text is truncated at an arbitrary byte length, this might mean truncation in the middle of a character.

Worse, our page abstracts can appear in result listings for keyword searches. If a page summary broken mid-character is inserted into running text, the byte immediately following the summary could be interpreted as the next byte of the previously uncompleted Japanese character, upsetting the character boundaries for the rest of the text.

Text::Sentence includes another supporting module, Lingua::JA::Jtruncate, which addresses this problem. Lingua::JA::Jtruncate contains just one subroutine; jtruncate(), used as follows:


    use Lingua::JA::Jtruncate qw( jtruncate );
    $truncated_jtext = jtruncate( $jtext, $length );

where $jtext is some Japanese text that you want to truncate, $length is the maximum truncation length, and $truncated_text is the result. Here’s how it works.

First, some regexes are defined that match characters in each of the three main Japanese coding schemes: EUC, Shift-JIS, and JIS.

%euc_code_set = (
    ASCII_JIS_ROMAN     => '[\x00-\x7f]',
    JIS_X_0208_1997     => '[\xa1-\xfe][\xa1-\xfe]',
    HALF_WIDTH_KATAKANA => '\x8e[\xa0-\xdf]',
    JIS_X_0212_1990     => '\x8f[\xa1-\xfe][\xa1-\xfe]',
    );

    %sjis_code_set = (
        ASCII_JIS_ROMAN     => '[\x21-\x7e]',
        HALF_WIDTH_KATAKANA => '[\xa1-\xdf]',
        TWO_BYTE_CHAR       => 
                          '[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]',
    );

    %jis_code_set = (
        TWO_BYTE_ESC        => 
            '(?:' .
            join( '|',
                '\x1b\x24\x40',
                '\x1b\x24\x42',
                '\x1b\x26\x40\x1b\x24\x42',
                '\x1b\x24\x28\x44',
            ) .
            ')'
        ,
        TWO_BYTE_CHAR => '(?:[\x21-\x7e][\x21-\x7e])',
        ONE_BYTE_ESC => '(?:\x1b\x28[\x4a\x48\x42\x49])',
        ONE_BYTE_CHAR       =>
            '(?:' .
            join( '|', 
                '[\x21-\x5f]',                    
                 # JIS7 Half width katakana
                '\x0f[\xa1-\xdf]*\x0e',             
                 # JIS8 Half width katakana
                '[\x21-\x7e]',                      
                 # ASCII / JIS-Roman
            ) .
            ')'
    );

    %char_re = (
        'euc' => '(?:' . join( '|', values %euc_code_set ) . ')',
        'sjis' => '(?:' . join( '|', values %sjis_code_set ) . ')',
        'jis' => '(?:' . join( '|', values %jis_code_set ) . ')',
    );

Each of the regexes in %char_re matches one character encoded in the scheme corresponding to the keys of the hash.

Now for the definition of the jtruncate() subroutine; first, some fairly obvious sanity checks:


    sub jtruncate{
        my $text            = shift;
        my $length          = shift;

        # sanity checks

        return '' if $length == 0;
        return undef if not defined $length;
        return undef if $length < 0;
        return $text if length( $text ) <= $length;

Now we save the original text; this is used later if the truncation process fails for some reason.

        my $orig_text = $text;

Now we use Lingua::JA::Jcode::getcode() to detect the character encoding. Lingua::JA::Jcode::getcode() is a simple wrapper around the jcode.pl Perl library for Japanese character code conversion. Kazumasa Utashiro kindly agreed to let us distribute the code with HTML::Summary.

my $encoding = Lingua::JA::Jcode::getcode( \$text );

If getcode returns undef, or a value other than euc, sjis, or jis, then it has either failed to detect the encoding, or detected that it is not one of those that we are interested in. We then take the brute force approach, using substr.


        if ( not defined $encoding 
                     or $encoding !~ /^(?:euc|s?jis)$/ ){
            return substr( $text, 0, $length );
        }

The actual truncation of the string is done in chop_jchars() - more about this later.

        $text = chop_jchars($text, $length, $encoding );

chop_jchars() returns undef on failure. If we have failed to truncate the Japanese text properly we resort to substr again. We had to decide whether it was more important to meet the $length constraint or risk returning a Japanese string with broken character encoding. We chose the former:

        return substr( $orig_text, 0, $length ) 
                                    unless defined $text;

Next, a special case: JIS encoding uses escape sequences to shift in and out of single-byte and multi-byte modes. If the truncation process leaves the text ending in multi-byte mode, we need to add the single-byte escape sequence. Therefore, we truncate (at least) three more bytes from JIS encoded string, so we have room to add the single-byte escape sequence without going over the $length limit.

        if ( $encoding eq 'jis' and 
            $text =~ /$jis_code_set{ TWO_BYTE_CHAR }$/) {
            $text = chop_jchars( $text, $length - 3, 
                                             $encoding );
            return substr( $orig_text, 0, $length ) 
                                     unless defined $text;
            $text .= "\x1b\x28\x42";
        }

And we’re done!


        return $text;
    }

Now for chop_jchars(), which simply lops off Japanese characters from the end of the string until it is shorter than the requested length. It’s pretty ugly, and slow for large strings truncated to small values, but it does the job!

    sub chop_jchars
    {
        my $text = shift;
        my $length = shift;
        my $encoding = shift;

        while( length( $text ) > $length ) {
            return undef 
            unless $text =~ s!$char_re{ $encoding }$!!o;
        }

        return $text;
    }