Truncating Japanese
Text
Canon is a Japanese company, with Japanese text on
many of its web pages. Japanese text is usually
encoded in one of several possible multibyte encoding
schemes, and some of these schemes use variable
numbers of bytes to represent single Japanese
characters, or intermingle Japanese and regular ASCII
characters. This was a problem.
The summaries generated by Text::Summary are
truncated at a fixed length, and this length is
specified in bytes, not characters. If Japanese text
is truncated at an arbitrary byte length, this might
mean truncation in the middle of a character.
Worse, our page abstracts can appear in result
listings for keyword searches. If a page summary
broken mid-character is inserted into running text,
the byte immediately following the summary could be
interpreted as the next byte of the previously
uncompleted Japanese character, upsetting the
character boundaries for the rest of the text.
Text::Sentence includes another supporting module,
Lingua::JA::Jtruncate, which addresses this problem.
Lingua::JA::Jtruncate contains just one subroutine;
jtruncate(), used as follows:
use Lingua::JA::Jtruncate qw( jtruncate );
$truncated_jtext = jtruncate( $jtext, $length );
where $jtext is some Japanese text that you want
to truncate, $length is the maximum truncation
length, and $truncated_text is the result.
Here’s how it works.
First, some regexes are defined that match
characters in each of the three main Japanese coding
schemes: EUC, Shift-JIS, and JIS.
%euc_code_set = (
ASCII_JIS_ROMAN => '[\x00-\x7f]',
JIS_X_0208_1997 => '[\xa1-\xfe][\xa1-\xfe]',
HALF_WIDTH_KATAKANA => '\x8e[\xa0-\xdf]',
JIS_X_0212_1990 => '\x8f[\xa1-\xfe][\xa1-\xfe]',
);
%sjis_code_set = (
ASCII_JIS_ROMAN => '[\x21-\x7e]',
HALF_WIDTH_KATAKANA => '[\xa1-\xdf]',
TWO_BYTE_CHAR =>
'[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]',
);
%jis_code_set = (
TWO_BYTE_ESC =>
'(?:' .
join( '|',
'\x1b\x24\x40',
'\x1b\x24\x42',
'\x1b\x26\x40\x1b\x24\x42',
'\x1b\x24\x28\x44',
) .
')'
,
TWO_BYTE_CHAR => '(?:[\x21-\x7e][\x21-\x7e])',
ONE_BYTE_ESC => '(?:\x1b\x28[\x4a\x48\x42\x49])',
ONE_BYTE_CHAR =>
'(?:' .
join( '|',
'[\x21-\x5f]',
# JIS7 Half width katakana
'\x0f[\xa1-\xdf]*\x0e',
# JIS8 Half width katakana
'[\x21-\x7e]',
# ASCII / JIS-Roman
) .
')'
);
%char_re = (
'euc' => '(?:' . join( '|', values %euc_code_set ) . ')',
'sjis' => '(?:' . join( '|', values %sjis_code_set ) . ')',
'jis' => '(?:' . join( '|', values %jis_code_set ) . ')',
);
Each of the regexes in %char_re matches one
character encoded in the scheme corresponding to the
keys of the hash.
Now for the definition of the jtruncate()
subroutine; first, some fairly obvious sanity
checks:
sub jtruncate{
my $text = shift;
my $length = shift;
# sanity checks
return '' if $length == 0;
return undef if not defined $length;
return undef if $length < 0;
return $text if length( $text ) <= $length;
Now we save the original text; this is used later
if the truncation process fails for some reason.
my $orig_text = $text;
Now we use Lingua::JA::Jcode::getcode()
to detect the character encoding.
Lingua::JA::Jcode::getcode() is a simple
wrapper around the jcode.pl Perl library for Japanese
character code conversion. Kazumasa Utashiro kindly
agreed to let us distribute the code with
HTML::Summary.
my $encoding = Lingua::JA::Jcode::getcode(
\$text );
If getcode returns undef, or a value other than
euc, sjis, or jis, then it has
either failed to detect the encoding, or detected
that it is not one of those that we are interested
in. We then take the brute force approach, using
substr.
if ( not defined $encoding
or $encoding !~ /^(?:euc|s?jis)$/ ){
return substr( $text, 0, $length );
}
The actual truncation of the string is done in
chop_jchars() - more about this later.
$text = chop_jchars($text, $length, $encoding );
chop_jchars() returns undef on failure.
If we have failed to truncate the Japanese text
properly we resort to substr again. We had to decide
whether it was more important to meet the
$length constraint or risk returning a
Japanese string with broken character encoding. We
chose the former:
return substr( $orig_text, 0, $length )
unless defined $text;
Next, a special case: JIS encoding uses escape
sequences to shift in and out of single-byte and
multi-byte modes. If the truncation process leaves
the text ending in multi-byte mode, we need to add
the single-byte escape sequence. Therefore, we
truncate (at least) three more bytes from JIS encoded
string, so we have room to add the single-byte escape
sequence without going over the $length
limit.
if ( $encoding eq 'jis' and
$text =~ /$jis_code_set{ TWO_BYTE_CHAR }$/) {
$text = chop_jchars( $text, $length - 3,
$encoding );
return substr( $orig_text, 0, $length )
unless defined $text;
$text .= "\x1b\x28\x42";
}
And we’re done!
return $text;
}
Now for chop_jchars(), which simply lops
off Japanese characters from the end of the string
until it is shorter than the requested length.
It’s pretty ugly, and slow for large strings
truncated to small values, but it does the job!
sub chop_jchars
{
my $text = shift;
my $length = shift;
my $encoding = shift;
while( length( $text ) > $length ) {
return undef
unless $text =~ s!$char_re{ $encoding }$!!o;
}
return $text;
}
|