July 1995/We Have Mail

Departments

We Have Mail

Dear Mr. Plauger,
I do enjoy CUJ, it is an excellent support with C and C++ problems. Recently I ran into a problem I could not solve. In an application I have to read out data from a text string. To save space the data is in a compact format. It goes like this. The input text is a geographical coordinate value, let's take 50x 07,51'. It comes as 500751. This is a part of a string of numbers, where the beginning position and the width is defined. The data should be converted into a float value. I make it the following way:

int iDegrees, iMinutes; float fValue; sscanf ( & (szInputText[COORD_BEGIN_POS]), "%2d%4d", & iDegrees, & iMinutes ); fValue = (float) iMunites; fValue /= 6000.0F; fValue += (float) iDegrees;
My problem is that the width of the data may vary. It would be great to define the width, and let the compiler generate that defined value into the format string. Instead of "%2d%4d" it would be easier to have

#define DEGREE_WIDTH 2 #define MINUTE_WIDTH 4 sscanf ( & (szInputText[COO:RD_BEGIN_POS]), "%DEGREE_WIDTHd%MINUTE_WIDTHd", & iDegrees, & iMinutes );
Should any of the widths change, I need to modify only once. Another variation to the theme could be if I could tell sscanf that behind the % there are two parameters: the first one is the width and the second is the target address.
In C++ one can use a derivative of the ios class and the setw manipulator. Is there a solution to this problem in C, or does the sscanf format string always need the explicit numbers? Thanks for your help in advance! Keep up the good work! Best regards,
Istvan Fay
Grobkrotzenburg
Germany
My preference is to gather the appropriate number of characters for each field into a text buffer, then use strtol to convert the buffer. Your first example can be made to work by using string concatenation, as in:

#define DEGREE_WIDTH "2" #define MINUTE_WIDTH "4" sscanf(&szInputText[COO:RD_BEGIN_POS], "%" DEGREE_WIDTH "d%" MINUTE_WIDTH "d", &iDegrees, &iMinutes);
This approach is simpler if the field widths are indeed hard-wired. — pjp
Editor:
I regret having to tell Bob Stout that median filtering is a very old technique, but it is known to be very useful at removing noise, and is worth bringing to people's atttention. In general, you can take N samples and discard the highest and lowest K. (See the November, 1994 Dr. Dobb's Journal, Algorithm Alley column for another example.)
Another filtering technique that is useful in very noisy environments consists of repeatedly discarding outliers, data points that lie furthest from the average, until only a few are left. This is used in clock-synchronization protocols to obtain very stable averages over several time exchanges. Some sample code is shown in Listing 1.
Colin
Yeah, I discovered filtering like this about 30 years ago, but as you point out, it's well worth showing to a fresh batch of programmers. — pjp
Editor,
Hello, I have subscribed to your magazine for over four years now, and would like to know if you plan on including any programs that take advantages of Unix's power. It seems your magazine is more DOS/Windows oriented (of course, I guess I can't blame you for that since so many use those OSs). Do you have any plans to incorporate more Unix-geared programs and articles in the future? I particularly would like to see more articles on X, and Ipc in Unix. Thank You.
Sincerely,
Raymond Chin-A-Young
As you observe, we are to some extent at the mercy of what's popular at any given time. That's what our contributors tend to contribute. But I am happy to see Unix on the ascendant once again, with the rising importance of network servers, if for no other reason. We certainly do intend to run more Unix articles in the future. In fact, we are looking for articles on C/C++ programming under Unix, particularly in the areas of network control, interfacing to more popular OSs (DOS, Windows, Mac), and providing assistance to script-driven services (a peculiar strength of Unix systems). If any of you would like to try your hand, help out your fellow programmers, etc., please send your proposals to marc@rdpub.com. There, you managed to elicit a special Call for Papers. Thanks. — pjp
Dear Dr. Plauger:
I would like to respond to the letter from Gene Norris in the December 1994 CUJ, in regard to casting the return value of functions such as printf to type void when the return value is not used. I work on several different Unix-based machines, including the HP 9000 and Sun SPARC. I have the HP lint and Sun alint which, I believe, are reasonably recent. The void cast keeps these versions of lint from complaining about functions returning values that are ignored. To demonstrate, I present a version of the classic "Hello world" program:
#include <stdio.h>
#include <stdlib.h>

main ()
{
printf("Hello, world\n");
return EXIT_SUCCESS;
}
On the HP, invoking:
lint -Aa -abhx hello.c
results in the following putative error message:
function returns value which is always ignored
printf
(-Aa invokes the ANSI/ISO version; -abhx disables the respective lint flags on this version of lint.)
If the printf call is cast to void:
(void)printf("Hello, world/n");
when the same lint command is run, no error is reported. The same is true with the Sun alint.
I know of no other way to get rid of such superfluous error messages. With a major program, where a source file may have many functions (printf, strcpy, tmpnam, etc.) with return values that are not used, the lint output would be cluttered to the point that a meaningful error message might be overlooked. I consider it a nuisance to use the void casts, but a worse nuisance to filter through the unnecessary lint output.
Have I overlooked a simpler way to get rid of these messages?
Roger Wells
Lacey WA
Software Engineer, U.S. Intelco Networks
(Opinions expressed are my own, not my employer's)
I don't know of one. You are right to identify this idiom as one dictated by various lint programs. It is not a part of the C Standard. But that hardly matters if it's widespread practice. — pjp
Dear PJP:
(Ref: CUJ March 1995 letter from Skip Pletcher, asking for info on creating batch file scripts that can be run from Windows).
I would suggest Mr. Pletcher look into WIN-BATCH, a shareware program for Windows. It is a big improvement on DOS BATCH, runs under Windows, and can mimic many of Windows' user-interfaces, e.g. scrolling pop up lists; etc. I don't have any references immediately at hand, but he can find it among the SIMTEL/WUARCHIVE archives.
Regards,
Scott Daniels
daniels@minga.pn.com
CIS: 73201,670
Editor,
I much enjoyed your issue which focused on algorithms. Efficient algorithms have long been of interest to me. On the whole, I found the articles to be well done and informative.
An exception would be Blase B. Cindric's "When the 'Best' Alogrithm Isn't." Mr. Cindric's basic conclusion (using Quicksort against highly ordered data is not efficient) is quite correct, but I feel he misses the best solution by a very large margin. Indeed, his article fails to focus on the crux of the issue at all. He suggests that a simple insertion sort algoritm is the key to efficiency, but his technique leads to improvement only where the number of elements to be inserted is very small relative to the number of sorted elements.
The real key to efficiency in a situation where unsorted data, U, is to be merged with sorted data, S, is to first sort the unsorted data! If efficiency is the goal:
1) Sort the unsorted U data (probably using Quicksort).
2) Merge the two sorted arrays starting with the last elements so that each element need only be moved a single time. This provides for a huge efficiency gain over the algorithm provided by Mr. Cindric. (In the worst case, the Cindric algorithm would move each data element U times!)
3) Find the insertion points using a binary-search algorithm that operates only on that portion of the array which has not been covered by previous insertions. This too can be a huge efficiency gain relative to Mr. Cindric's algorithm — particularly if the comparison function requires disk hits and/or extensive CPU resources. It is the difference between fewer than U * log S comparisons and U * S comparisons.
4) Consider using memcpy to move those runs of elements from each sorted array which fall between elements in the opposing array.
Is this more complicated? Somewhat, but the algorithm described above will be more efficient than either a simple Quicksort or Mr. Cindric's algorithm in essentially all cases that sorted and unsorted data are to be merged — even where the unsorted array is large in comparison with the pre-sorted array.
M.E. Stanley
Source Code Systems
Wichita KS 67206
I have learned through humbling experience that an algorithm can almost always be improved upon with a bit more thought. (Witness what K.B. Williams reveals about my earlier attempts at sorting, elsewhere in this issue.) At least we agree that Cindric's basic conclusion is correct — if you know that the data has special properties, that should color what you think of as the "best" algorithm for dealing with it. — pjp
Dear CUJ,
I agree with Mr. Cindric's main point in his article "When the 'Best' Algorithm Isn't," CUJ March 1995, namely that one can get better performance by tuning general algorithms for specific implementations. I have a few additional comments.
I would expect a merge/insert sort to perform better for the case cited in his article, where a small unsorted list is merged with a much larger sorted one. By pre-sorting the small list, it can be merged with the larger list in one pass. Based on the list sizes shown, from 5 to 200 times fewer compare and move operations would be required. The article assumed that a Quicksort routine was already available.
This method would require an additional small buffer to hold the new, sorted items during the merge/insert pass. This buffer could be of fixed size, if desired, in exchange for multiple merge/insert passes through the large list. Further heuristics in the style of Mr. Cindric's analysis would reveal how to best balance size of this buffer and the number of passes for optimal performance.
Mr. Cindric concludes that one can do better by taking "advantage of any special properties you know about the data set when choosing an algorithm." I'd like to point out the implicit assumption that the data set is then constrained to always have these "special properties." The example in the article meets this criterion.
Over time, though, useful code evolves. Perhaps someone "improves" it. Or maybe the input data properties change. What if the code is modularized deeply, say as a method within a C++ object? Good internal documentation and documentation maintenance (things we all do, right?) are necessary to prevent the input requirements from being lost or misrepresented. In contrast, a generic algorithm needs less documentation.
I look forward to more easily digestible, thought-provoking, tidbit-style articles like this. Thank you, and keep up the good work.
Mike Cepek
cepek@mgi.com
Thanks for the reminder. The most robust solution for the long haul is often not the most optimal solution for the situation today. — pjp
Dear Editor:
The article "When the 'Best' Algorithm Isn't" by Blase B. Cindric (CUJ March 1995) should also have mentioned another possible problem with Quicksort in certain situations. A lot of the performance in Quicksort depends on the pivot point used to split the original data into two subsets. If the data were random, then any choice for a pivot point would be as good as any other.
However, real data tends to have some pre-existing order in it. This makes the middle value a good choice for a pivot point, but not always. One way around this problem is to use a sample of five or so data values and take an average to construct a pivot point. This value might not even be in the data, but it will still work.
If the data to be sorted increases for the first half of the original file and then decreases for the second half, you will get a worst case performance from it. The sampling trick will not help, since the sample in a symmetric data set will tend to be the middle point.
The best improvement on Quicksort is to combine it with the Bubble or pairwise exchange sort, which is known to be the worst possible algorithm. As the left and right pointers move toward each other, let each of them be used to do pairwise exchanges on their side of the pivot point. This swapping is quite fast and can often be done with machine level instructions.
Yours,
Joe Celko
Atlanta GA
CompuServe: 71062, 1056
Veteran writer Celko makes some good observations, as usual. Again, see the article by K.B. Williams on sorting. — pjp
Dear Mr Plauger,
I was very interested to read Mr Holst's letter in the March 1995 issue of CUJ. Although I do not use the Borland C++ compiler myself, I would assume that it is broadly comparable with Microsoft's Visual C++. I would suggest that there is probably a compiler option that allows the user to use a large memory model, thus getting round the 64 Kbyte limit of the medium memory model which (I would guess) Mr Holst is probably using. Certainly, Visual C++ has such an option.
If Mr Holst is not already aware of it, he might like to read the article that I wrote and you published in September 1994, entitled "Using Windows Memory Management Services." For Mr Holst's benefit (as I am copying this message to him), this presented a set of C++ classes for accessing Windows global memory services. He might also like to read the exchange of letters between myself and Mr. Larsson, published in the January 1995 issue of CUJ, concerning this article.
Thanks for an interesting and educational magazine.
David Singleton
100265.3625@compuserve.com
P.J.,
In the "We Have Mail" section of the March 1994 issue you fielded a letter from a Niels Holst in reference to new and the Windows programming environment. I have responded to him directly on his implementation. The compiler already directly supports what he wants to do:
type far *s = new far type[[(args)]];
Be that as it may, I wanted to respond to you directly about your suggestion that the placement syntax of operator new be used. In the last week I have fielded two queries in the BCPPWIN forum on CompuServe directly relating to a side-effect the use of the placement syntax has when exceptions are involved.
When an exception is thrown in the constructor of a class the expected behaviour is that if new was used to allocate the object then the object will automatically be deleted. The ARM, however, has this to say starting on page 62:
"5.3.4 Delete
......
The effect of applying delete to a pointer not obtained from the new operator without a placement specification is undefined and usually harmful."
Basically speaking, when the placement syntax of new is used to allocate an object, then from within the constructor a compiler currently has no decent way to determine whether the object can, in fact, be safely deleted. The placement new syntax implies that the object in question might have been constucted on top of a static or automatic variable, so there's no way for it to safely delete the object in a generic manner. The delete, therefore, does not occur.
In terms of implementation, this behaviour derives from the common practice of having the constructor perform the actual allocation of a newed object. When a constructor is called it receives a null pointer if the object was allocated via a non-placement new statement, and a non-null pointer to the object's storage space in all other cases (including placement new).
The constructor itself calls ::operator new (or a class-specific overload of operator new) to get the required storage space if this pointer is null, and the original null pointer serves as an indicator that, should an exception be thrown, the object should be deleted. If it's not null, the object is not deleted.
However, changing this mechanism would not serve to clear up this side effect since the presence of the placement syntax implies the the object might be unsafe to delete. Only an inspection of the implementation of the specific operator in question will determine if it can or cannot be deleted, and there's no guarantee that that implementation will be visible at the time when such an inspection might be required.
Using placement new as a generic means of allocating objects is therefore dangerous in an ANSI-compliant C++ compiler (such as BC++ 4.0) in that it can and will result in unexpected memory leaks.
Paul LeBlanc
72662.1326@compuserve.com
Yes, the interaction of construction, destruction, and exceptions is fraught with peril. I get headaches trying to think through styles for writing robust code that can throw exceptions. You've just reminded me of an old headache that had faded from memory. Thanks, I think. — pjp
Dear editor,
The article on genetic algorhithms in the March 95 issue of CUJ delighted me. I had just completed my own GA in C++ and enjoyed the chance to compare my code with Keith Grant's. I am a hobbyist programmer with a yen for such challenges.
Not having read much technical literature on GAs, I decided to adopt a simple scheme: mate the top half of a sorted fitness array, and throw the results into the bottom half. The least-fit half just gets discarded. Then mutate the whole thing, and sort by fitness again. For crossover I used a random splice point instead of random uniform crossover through a mask. To my surprise the GA worked, but tended to oscillate. By mutating only those chromosomes below the top fit few, I fixed this, and managed to achieve steady convergence.
Next I decided to apply the knapsack problem exactly as coded in the article. Imagine my surprise when my GA seemed to out-perform Keith Grant's version. My GA returned the correct answer (weight = 17, value = 24) after only 20 generations on average. This was too good to be true, so I did some checking and realized that each of his generations involved 2 new chromosomes while each of mine produced a full one-half population renewal. With my GA, 20 generations with a population of 32 gives 32 + (1/2 * 32 * 20) = 352 new chromosomes, which was in close agreement with Keith's parameters of 350 out of a search space of 16,000.
The above confusion led me to look at the odds a bit. If one takes a 14-bit GA string, there are 16,384 possible answers. The duplication of knapsack items, however, allows for mutiple solutions. I found seven exact solutions to this simple problem, so the odds go down as low as one in 16,384/7 = 2,340; i.e. any random chromosome has a one in 2,340 chance of being an acceptable solution. Therefore, for 350 randomly chosen chromosomes, the probability is one in 2,340/350 = 6.69. This implies to me that both of our algorhithms perform about six times better than chance! While implementing the knapsack encoding, I discovered that results were very dependent on the fitness function. Changing the penalty levels in relation to positive reinforcement can produce different behavior. The GA was sensitive to variation in mutation rates also. My GA functions better at about 10% while Keith was using 20%. This led me to the conclusion that the attempt to optimally abstract a universal GA engine from all problem domains may be difficult at best. How do you solve problems without encoding specific information about the problem into the problem solver? Won't there always be subtle side-effects or cross-currents between problem and problem-solver? Are GAs basically trivial, while the fitness functions and coding schemes contain all the creative juice?
Every attempt to boost efficiency seemed to add more coding with less return, while the simplest of crossover-mutation concepts seemed to get the job done. I tried to incorporate rank biased probability breeding into my code as Keith Grant does, but results became more sporadic. Sometimes convergence was quick and sometimes not. I wonder if different problems will establish different dynamics. At any rate, the article posed a lot of questions to me. My thanks go to Keith for the inspiration and insight into this fascinating topic.
Swinton Roof
Memphis TN
Thanks for the feedback. — pjp