November 1993/We Have Mail

Departments

We Have Mail

Dear Mr. Plauger,
I like your Journal, although C proved not to be the lingua franca for my specialization, Computational Linguistics. However, I like reading about computer programming and the programs you publish are a joy to read to me. I was surprised to read the anouncement this month on the cover "Natural Language Processing" and jumped into Mr. Suereth's article "A Natural Language Processor." And frankly stated, it turned out to be a disappointment to me. I am afraid, the editorial board is lacking expertise in the field of computational linguistics. Let me describe briefly the reasons for my disappointment.
The main reason is the discrepancy between the written text and the computer program in Mr. Suereth's article. In fact, the article does not describe the program and the notions used in the article do not match the notions in the linguistic field. It would be great if sentences could be described as "the input transaction record" with the words as "fields" and the dictionary as "a master file of word information." Unfortunately, that is impossible. NLP is not the equivalent of database processing, as I hope to make clear to you. Equally it is a simplification to conceive the semantic component of NLP as Mr. Suereth seems to do: "The meaing is derived from a combination of information in the input sentence, the dictionary, and program code." Well, do not programming humans not understand natural language? The term transformational grammar and phrase structure grammar are vital to the professional computational linguist, although they are not the sole theoretical framework for NLP. Those terms, however, have a different meaning from the ones used by Mr. Suereth. The kernel of transformational grammar is the notion of "deep structure," an abstract representation of the meaning of natural language utterances. Mr. Suereth seems to use "transformational" in the normal way for programmers as "data transformation." Because the deep structure of transformational grammar really is abstract, the TG model could hardly be made operable on computers. Also the term "phrase structure" is used in a non-linguistic way by Mr. Suereth. Phrase structure grammar is often used for the syntatic component of NLP. It describes the syntactic structure of a sentence.
Mr. Suereth writes about "phrase structures," and seems not to know that a sentence only has one syntactical structure defined as the structure of the phrases in the sentence. This phrase structure is not what Mr. Suereth seems to think it is: For the sentence "Sue is going to the store" he describes "The underlying structure" as: "name-auxilary verb-preposition-determiner-noun." In phrase structure terminology, however, the structure of this sentence would be defined like this, if I take brackets-notation:

(S(NP(N(Sue))VP(V(is going)) NP(PRP(to))(DET(the))(N(store)))))
Normally in phrase structure grammar the sentence structure is visualized as a tree, because that is easier to understand for humans. Mr. Suereth does not define the structure of the sentence, he defines the linear sequence of the word classes in the sentence. The program is not operating on the structure of the sentences in his database; the program merely maps the linear sequence of word classes to arrays. Mr. Suereth can do this because he knows in advance what type of sentences he will operate on. Therefore, the program only works for this type of word class sequences, if and only if there are no ambiguities in them.
Even when the sentences are non-sense the program will work: If you write: "Jim is going in (on/ to) the jump" or "The house is on (in/ to) the jump" the program will not complain and what is more it can only be repaired by ad hoc measures. This holds for all the problems that will come up. In other words: this program cannot be expanded as Mr. Suereth thinks it can. The program is useful for applications where the programmer knows in advance all sentences. The claim that the program can be expanded to a reader of books only holds if Mr. Suereth can prove that he knows in advance all sentences in all books. This kind of "natural language processor" normally is delivered with computer games. The authors, however, present this part of their programs with a bit more of modesty, not as natural language processors.
By the way, much more is already possible if it comes to natural language processing and natural language understanding, than can be concluded from your articles on NLP in the April issue. Your article would be up to date in 1975, but that is beyond the scope of this letter.
With kind regards,
Dr. M. Boot
Rubensln. 40
3723 BR Bilthoven
The Netherlands
We were happy to present Mr. Suereth's offerings as a modest attempt at parsing a simple grammar and vocabulary. Those who know little about the field seemed to enjoy reading the code. Those who know a lot were disappointed. I guess the one thing I would do different in future is have the author better emphasize the modesty of his/her goals in any area that might be haunted by experts. — pjp
Dear P.J. Plauger:
I have just read your article on Large Character Set Functions in the June 1993 issue of The C Users Journal. The section entitled "<stdlib.h> revisited" states that the functions strtod, strtol, and strtoul convert arithmetic representations to text strings, as do their new wide-char analogs. There is also some discussion of "the characters they generate."
I'm sure you meant to say that these functions convert the other way, i.e. strings to arithmetic representation, and that they don't generate strings, just read them. But I just wanted to check with you to make sure that I wasn't missing something.
Thanks for providing these informative articles. I'm looking forward the next one in the series.
uunet!sol.metaware.com!marka ()
It's the way you said. If I said it the other way, I misspoke. Thanks for pointing out the error. — pjp
Dear P.J. Plauger
I have a problem. I figure if ANYONE knows a workaround, it would be you.
I am writing code by hand that will interface with code produced by an application generator. If the persons running the application generator change structure sizes on me, my code can bomb. Well, if it has to be rewritten, it has to be, but I'm trying to avoid having 20,000 copies of this application duplicated and shipped to users — and then finding out there is a problem.
When I was working with Borland C, I could use

#if sizeof (structurename) != 132 #error #endif
and it worked fine. However, this project is being done in Microsoft C, and they do not have this extension to the ANSI standard.
It wouldn't help, however, even if it DID happen to be ANSI. I'm not dealing with standards — I'm dealing with a real-world compiler, dealing with a real-world problem, and trying to figure out another way to generate an error at compile time so that I don't generate errors at run time.
Any suggestions on how I might accomplish this?
Stephen Thomas
CompuServe, Inc. Columbus OH
614-793-3121
CIS: 70004, 1473]
stevet@csi.compuserve.com
Try:

static char junk[sizeof (structurename) != 132 ? 0 : 1];
It's Standard C that should generate a diagnostic if the structure has the wrong size. If the code compiles, you waste a byte of storage. — pjp
Dear Dr. Plauger,
Thanks for a fine editing job at the C User's Journal.
The recent focus of your column, Standard C, on E extensions proposed for non-English character sets is timely. You are raising the awarenes of the C/C++ programming community to the needs of computer users in non-English speaking countries at a time when a strong trend toward the internationalization of business markets is in progress and gaining strength. My impression is that American developers need to play catchup in meeting the needs of potential customers that don't speak English.
The thrust of the extensions to C you are describing in your column are absolutely crucial to meeting the language needs of European, Asian, and Latin American business partners. These character set provisions, though, are by no means the whole story. A few examples:

Hebrew and Arabic both use non-Latin alphabets, but both languages are read right to left, except for the numbers and quotes in other languages like English or French.

Most code I've seen contains things like:

if(!(map=makebitmap(64))) printf("Can't make bitmap\n");
The problem with this is that the error message is a string imbedded in the code in English. The only way to translate to another language is to find all such statements and change the code.

Similar translation considerations apply to screens and help text files.

Some countries, Belgium and Switzerland to name a couple, have a need for a user interface allowing the user to choose from any of several languages in use there. In mulltiuser implementations in countries like this, the system should be able to "speak" more than one language.
And so it goes. I don't want to make a long story continuous, but I believe the C Users Journal could provide a very useful service by publishing articles or perhaps a column showing your readers how to make their software products competitive in a global market.
Sincerely,
Harry Philips
Intrnet: hkp@tdkt.kksys.com
Fido: 1:292/36.4
Your points are well taken. You should know, however, that even the current C Standard permits *printf* to display right to left, or bottom to top, if the implementation so chooses. That doesn't solve the problem of mixed directions, but it's a step in the right direction. So too is ISO 10646, a character encoding that subsumes all the character sets of the world. So too is the locale machinery introduced with Standard C and embellished in POSIX. It provides for multiple cultures even within a country, selectable at run time.
Still we know that isolating and translating messages can be an important part of internationalizing code. We will continue to run articles on this and related topics as we get good submissions. — pjp
Dear P.J. Plauger,
Your report on WG14 progress (in C User's Journal, May 1993 Vol 11, No. 5) is much appreciated. This new interface goes a long way toward allowing i18n code to be both correct and portable.
It seems to us that there is one essential feature still missing, however: The interface provides a way to get a wide char from a FILE, which is good, but it is still impossible to "get" from any other source of bytes.
Why does this matter? For one thing, wide string i/o could then be added by users on top of existing implementations of iostream, until vendors support it themselves. (C++ is good at things like this.) In any case, there are many other sources of bytes besides FILE.
What does this essential feature look like? Typically you need characters to read up to a delimiter, and then push it back. (This delimiter may be restricted to USASCII if the byte source itself only allows one byte of pushback.) To read a character at a time, you need to be able to tell how many bytes are in the character to come, with no more information than a peek at its first byte.
So to support this feature, we need a function that takes one byte (and maybe ignores it) and a shift state (ditto), and returns the number of bytes in the character it heads. Note that this is not the same thing as mblen or mbrlen, as I understand them.
Of course this function has already been implemented in some systems (notably Sun's) under the name euclen.
Nathan Myers
Rogue Wave Software
myersn@roguewave.com
I didn't make clear that you can also read and convert wide characters from a multibyte string in memory, by using sprintf or vsprintf in wide-character mode. I will soon have C++ iostreams built atop this machinery, available for public consumption, so you'll be able to do similar things in C++.
UNIX systems tend to favor shiftless multibyte encodings such as EUC, which also have other nice properties. I'm not convinced, however, that all multibyte encodings determine the number of bytes in a character from the shift state and the first byte of the sequence. Thus, working in memory with mblen or mbrlen is more robust. The latter should parse an arbitrary string if you let it look at no more than MB_OUR_MAX bytes at a time. — pjp
Dear Mr. Plauger,
In your last column 'Standard C' you write that functions from <ctype.h> should be used religiously. Of course I share this view, but I think programmers are sometime unaware of the pitfalls. Consider the following code fragment that may run flawless before using an internationalized version of the isalnum function:

int c; ... read c from somewhere ... if (isalnum(c)) ... else if (c == '|') ...
If there is now an internationalized version of this program and the programmer thinks it would be a good idea to call setlocale(LC_ALL,"") in main so that the user may select his preferences dynamically, the program can fail, e.g. if the user selects the German variant of seven bit ASCII, as in this case | will be recognized as letter. The problem is not at all theoretical but the reason why yacc rejects working grammars in some internationalized versions of UNIX System V.
To summarize:
1) Using functions from <ctype.h> is always a good idea;
2) witching away from the default "C"-locale in main should not be done blindly;
3) If a programs mixes calls to character classification functions with comparisons against explicit character constants, it is a strong indication that more thought is necessary to produce a working internationalized version. Best Regards,
Martin Weitzel
You have stated the issues beautifully. I absolutely agree. — pjp
Dear Mr. Plauger:
I am writing to suggest you refrain from using the phrase "voted out" because it is confusing. You used it in this month's Standard C column (CUJ, April 1993), saying that "WG14 finally voted out an amendment to the C Standard". At first blush, I thought this was similar to a politician being voted out of office (i.e., the vote failed). Upon reading the rest of the paragraph, however, I realized that the vote passed (similar to "hammering out" an amendment). I think I have a fairly good command of the English language, but I am always confused by this phrase so I would ask that you try to avoid it in the future.
Thank you, as always, for an excellent publication and column, both of which are (otherwise) very clear and concise.
Ken Van Camp
<cp486a!kvancamp@cpmail.att.com>
AT&T Consumer Products Division
5 Wood Hollow Road Room 1H36
Parsippany, NJ 07054
(201)581-4513 — voice
Point well taken. — pjp