Open Source Meets Big Iron

Dr. Dobb's Journal June 2000

Open-source software finds a natural home

By Pete Beckman and Gregory V. Wilson

Pete worked at Los Alamos National Laboratory for many years, until leaving to become Technical Director of TurboLabs, a division of TurboLinux. He is now working on Linux clusters for high-availability and high-performance computing. Greg is a DDJ contributing editor and is coordinating the Software Carpentry project on behalf of CodeSourcery, LLC. He can be reached at gvwilson@ddj.com.

Dozens of studies during the last 20 years have shown that good working practices improve programmer productivity more than new languages, WYSIWYG interfaces, CASE tools, and other silver bullets (see, for instance, Steve McConnell's Rapid Development, Microsoft Press, 1996, ISBN 1556159005). Despite this, most programmers still start coding without a design, then go on to short-change testing and set wildly unrealistic delivery schedules.

One reason for this is that as long as companies can IPO for $100 million with flaky software, there is little incentive for programmers to raise their standards. Less cynically, good software-engineering skills aren't taught at universities, primarily because the academic system forces people to focus on publishability rather than good engineering. Finally, and crucially, existing tools make good practices harder to follow than they need to be.

As bad as this situation is for trained programmers, it is even worse for scientists and engineers. Scientists consider laboratory results valid only if equipment is calibrated, samples are free of contamination, and all relevant steps are recorded. Software, on the other hand, is rarely required to meet these standards, or any standards at all. Despite everyone's personal experience with buggy code, the correctness of scientific simulations is rarely questioned, and reproducibility is rarely -- if ever -- demanded.

Partly, this is because science and engineering students get even less exposure to good software-development practices than their peers in computer science. Once they graduate, staying on the leading edge in science or engineering is already more than a full-time job. Specialists in fluid mechanics, global climate change, and human genetics don't have time to learn how to quote shell variables in recursive makefiles.

Open Source, Open Science

The lack of software-engineering skills among scientists and engineers has become a critical bottleneck in many fields. Computer simulations are increasingly used to study problems that are too big, too small, too fast, too slow, too expensive, or too dangerous to study in the laboratory. Many scientists and engineers also now realize that publication and peer review of software need to be as integral a part of computational science as they are of experimental science (see "Catalyzing Open Source Development in Science," by J. Daniel Gezelter, Open Source/Open Science '99, http:// www.openscience .org/talks/bnl/index.html).

The main stumbling block in this scenario is the degree to which a scientist's and engineer's lack of software-engineering skills constrain his or her ability to develop and inspect software. Simply put, someone who does not know how to test software cannot tell whether someone else's software has been thoroughly tested. Similarly, successive waves of graduate students cannot contribute to a shared code base without a basic understanding of design, inspection, testing, and configuration issues. (Many of the difficulties encountered by past efforts to build "community codes" can be ascribed to this problem.) It will, therefore, not be enough to build better tools -- scientists also need examples of design documents, test plans, code reviews, and everything else that makes up good software engineering.

The Open Source model seems to be an elegant solution to these problems. Modern science, with its emphasis on sharing ideas and peer review, is in many ways the original open-source project. Open Source development can also provide scientists working in very specialized domains with a welcome degree of bankruptcy insurance.

For instance, supercomputing has long been caught in the following costly cycle:

1. Pay a small, specialized company several million dollars to develop an important piece of code.

2. Pay scientists to learn and port their software to the new product.

3.Watch the company be sold, go bankrupt, or move on to a different marketplace.

4. Pay a new company to develop a new version of the old software.

5. Pay scientists to port their old software to the new product.

6. Repeat for decades.

Beowulf and Extreme Linux

Issues such as these have been simmering for years, but a seismic shift in supercomputing has recently brought them to the forefront. In 1990, more than a dozen startups and large computer manufacturers were building "big iron" (see Past, Present, Parallel: A Survey of Available Parallel Computer Systems, edited by Arthur Trew and Greg Wilson, Springer-Verlag, 1991, ISBN 0387196641). Most of the startups were excited by the industry's need to design new plastics, simulate car crashes, or analyze global climate change. For established manufacturers, such as IBM and Fujitsu, supercomputing was an extension of their existing mainframe business; they delivered expensive machines to a small number of customers, along with onsite engineering support, enhanced compilers, and specialized software libraries.

As it turned out, the killer apps for both large and small computer manufacturers proved instead to be the Internet and desktop/server business solutions. The specialized supercomputer manufacturers folded or were bought, while the IBMs and Fujitsus concentrated on commodity machines, hardened servers, and Internet-commerce software. Even Cray Research, whose name had become synonymous with supercomputing, was eventually folded into SGI. This, and the decline of military research budgets at the end of the Cold War, led to a near collapse in the market for special-purpose high-performance hardware and software.

At the same time, the price of commercial off-the-shelf (COTS) components such as PC motherboards was plummeting, while their performance was doubling, and doubling again. In the early 1990s, various research projects were turning collections of desktop machines into powerful compute engines. Dollar for dollar, racks filled with dual-processor Pentium III motherboards can theoretically provide several times more crunch than special-purpose supercomputers. As a result, their popularity has skyrocketed.

One project in particular, Beowulf (http://www.beowulf.org/), focused on extremely cheap commodity components (Intel 486s) and the then-new Open Source operating system Linux. That combination turned out to be a winner -- so much so that machines of this kind are now often named after that project (see How to Build a Beowulf, by Thomas L. Sterling, John Salmon, Donald J. Becker, and Daniel F. Savarese, MIT Press, 1999, ISBN 026269218X).

Crucially, almost all Beowulf machines run Linux. This is not just because it is free (although that certainly helps their price/performance ratio), but also because it is completely open. No commercial operating system was designed to link thousands of separate machines: Fixed-size internal tables overflow, interprocess communication is painfully slow, and collective status monitoring is nonexistant. Issues such as these are much easier to address when the entire source base of the operating system can be inspected and modified. COTS supercomputing would simply be impossible without Open Source software.

A more recent variation on the Beowulf theme has been Extreme Linux (http:// www.extremelinux.org/). Extreme Linux clusters usually use the fastest available memory hardware, and rack mounts instead of discrete cabinets. Most importantly, they use fast (that is, expensive) interconnection technology in order to improve overall performance on applications that are not easily parallelized. These clusters now offer the system software that supercomputer users expect, such as batch job schedulers, high-quality compilers, parallel debuggers, and global file systems.

Parallel Hardware Meets Serial Wetware

The biggest problem that cluster builders now face is configuring and building the complex software such systems require. It shouldn't, for example, take someone with a Ph.D. in fluid mechanics six weeks to figure out how to build an adaptive configuration script, but tools like autoconf are exceptionally cryptic (even by computing's generous standards), difficult to debug, and hard to maintain.

Given the high entry cost of writing good, maintainable software, many scientists choose to cobble something together to get them through to their next publishable result or funding milestone. Later, as the code is extended and ported to new architectures, the lack of adequate, easy-to-use tools pushes scientists to transform the code into an ever-less-maintainable set of rules and exceptions.

The cost of this is hard to estimate. One data point is that the biggest machines at the U.S. national laboratories cost more than $100 million each, and that their useful lifetimes are four or five years. This works out to roughly $2700 per hour for hardware alone! Clearly, the cost of having such a machine sit idle for a couple of hours while someone tries to reverse engineer a 10-year-old makefile to add a new thread library to a program means that any improvement in productivity is well worth the investment.

Eight Megabytes and Constantly Swapping

What was once a small set of simple tools has grown to be complex and inconsistent, and is being asked to solve problems that are larger and more complex than their original authors envisaged. Few people would use a 25-year-old compiler; surely, we've learned enough in a quarter of a century about configuring, building, and testing programs to start work on a better toolkit.

Many people's first reaction to this suggestion is that these tools are so entrenched that they'll never be displaced. Of course, if you work in scientific computing, you can hear seemingly sane people say the same thing about Fortran-77...

The second reaction of many developers is to paint a GUI interface over existing tools. However, this doesn't address fundamental issues such as interoperability and expressability. The functionality of make, for example, is hard to get at if you want to use it programmatically -- its dependency detector and rule engine could be useful in many contexts, but since they're not implemented as Perl modules, COM objects, or something similar, developers must either write their own from scratch, or do some heavyweight hacking. This can be done -- an IDE can churn out a makefile and launch make as a child, for example -- but that's fundamentally the same as saying that Fortran-77 is Turing equivalent, and therefore adequate.

The basic problem is that more and more people (such as scientists and engineers) need to write software as part of their job, but the tools they are given are things that only a programmer could love. Both of us have Ph.D.s in computer science, and can use existing tools pretty well, but have accepted that they are simply not adequate for the other 99.9 percent of the population.

Open Source, Open Issues

One of the reasons the Los Alamos National Laboratory has set up the Software Carpentry project is to see how well the Open Source community can meet the needs of that 99.9 percent. To date, Open Source has mostly been about hard-core programmers building software for other hard-core programmers; there has been relatively little serious effort to build software for other communities. Even in its heartland, Open Source has not done much to make its offerings accessible to people who have better things to do than read source code.

And despite what some of its more enthusiastic advocates claim, the Open Source model does not inevitably lead to better software. When a commercial project runs into trouble, it is still usually shipped because the company needs to make some kind of return on its investment. When an Open Source project runs into trouble, on the other hand, it is usually just abandoned. This selectivity skews the apparent success rate in Open Source's favor. As enthusiastic as some scientists and engineers are about Open Source development, they need to be shown that it can meet their requirements.

Software Carpentry

This brings us to the Software Carpentry project (http://www.software-carpentry .com/). The aim of this project is to create a new generation of easy-to-use software engineering tools, and to document both those tools and the working practices they are meant to support. The Advanced Computing Laboratory at Los Alamos National Laboratory is providing $860,000 of funding for Software Carpentry in 2000-01, which is being administered by Code Sourcery, LLC. All of the project's designs, tools, test suites, and documentation will be made available under the terms of an Open Source license.

The first stage of the Software Carpentry project is a design competition, with $100,000 in prizes for entries in:

SC Config, a platform investigation and project reconfiguration tool to supersede autoconf.
SC Build, a dependency management and program reconstruction tool to supersede make.
SC Test, a unit and regression testing framework.
SC Track, an issue tracking system.

These categories were selected because the working practices they support are essential to medium-scale software engineering. In addition, they are small enough that the project will be able to demonstrate results by the end of the first year. All entries will be published on the Web, along with comments from a 17-member judging panel that includes noted software developers, authors, and computational scientists.

We see several potential benefits from the design competition. First and foremost, applying the Open Source "thousand eyeballs" model to the design stage ought to lead to better designs. Second, it will produce examples of good design documents, test plans, and other artifacts (along with expert commentary) as a by-product.

Third, design competitions are a great way for up-and-coming developers to attract attention. Even if someone doesn't win, making it into the finals ought to catch the eye of people in need of good software architects.

Finally, we hope that if people believe that they are being listened to when something is being designed, they are more likely to throw their weight behind it when it is being implemented and deployed. There is a lot of fracture right now in some parts of the Open Source community (GUI toolkits, Linux desktops, and the like). We think that the right time to build consensus is before code is laid down, rather than after everyone has months of work to defend.

Once winners have been announced, Software Carpentry will fund their implementation, review, testing, and documentation. All tools will be required to run on both Linux and Windows NT, and be implemented primarily in (or be scriptable with) Python. Some people have questioned the project's decision to use a single language for implementing these tools. This was done to make the tools easier to install, maintain, learn, configure, and extend. Mixing languages might have made life easier for a few developers, but would certainly have made it harder for the majority of users.

Conclusion

The combination of commodity hardware and Open Source software has the potential to free supercomputing from its dependence on expensive few-of-a-kind machines. At the same time, better software engineering practices are essential if computational science is to stand beside its theoretical and experimental partners. We believe that the Software Carpentry project is a step in this direction, and hope that you will become involved in designing, implementing, and using a new generation of software tools.

DDJ