Article	Figure 1	Figure 2	Figure 3	Table 1
Table 2	apr2007.tar

Setting Up a Compute Cluster Using Grid Engine

Rayson Ho

A sports car is all about performance, yet it has much lower throughput than a truck. Consider trying to move with a sports car -- it takes many round trips because the car can carry so little in comparison to a truck. I know, because I've done it. You get the same saturation when you have lots of computational jobs to run. Although a high-performance cluster can get each job done very quickly, the throughput is way less than that of a high-throughput cluster. Thus, it actually takes less time to finish all the jobs on a high-throughput cluster even though each job may take longer to finish.

The job scheduler is the brain of a high-throughput cluster. Without a job scheduler, a high-throughput cluster is just a network of workstations. A job scheduler handles job submissions from users with different priorities (users in different workgroups). It takes care of jobs with different hardware and software resource requirements. It handles job interdependencies and workflows. When there are no free machines available, the job scheduler queues up the job; and when a machine becomes available, the job scheduler starts a new job from the queue. Grid Engine is the most common job scheduler with all the functionalities. But unlike renting a moving truck, Grid Engine is free and open source.

About Grid Engine

In 2000, Sun Microsystems acquired Gridware, Inc. Gridware developed the resource management software CODINE and GRD. In August 2001, Sun released the source code under SISSL (Sun Industry Standards Source License) and renamed the project Grid Engine. Since its release to the open source community, Grid Engine has been ported to new platforms, and new features have been added by the community in each release. Furthermore, Sun continues to hire developers to enhance it and markets it as N1 Grid Engine. Periodically, new add-on modules are donated by Sun back to the project. Grid Engine was used to schedule rendering jobs for the movie The Ant Bully. It is also the job scheduler for the Sun Grid, and several Linux cluster distributions include Grid Engine as the default job scheduler.

Architecture

The simplest Grid Engine cluster has one master host and one execution host. The job container -- the qmaster -- and the scheduler daemons run on the master host. The qmaster daemon accepts and queues up job submissions from the users. Periodically, qmaster updates the scheduler with the job and machine load information. The scheduling decisions made by the scheduler daemon are then sent back to qmaster, which will send the jobs to the targeted execution hosts (see Figure 1 and Table 1).

When the execution daemon, the execd, on the execution host receives a job request from the qmaster, it runs the job under the user ID of the job's owner. And when the job is done, it notifies the qmaster and returns the job execution results. Note that the master host is by default an administration host. Furthermore, a host can have all four roles, or in other words, you can configure the master host to run jobs. Although this configuration is not recommended for very busy clusters (those with thousands of submitted and dispatched jobs per day), many clusters are configured this way to utilize the master host's computation power.

While a single execution host cluster cannot handle a large number of jobs concurrently, this is actually the configuration that most people have when they first get started with installing and using Grid Engine. More execution hosts can be added after the initial install. My advice is to start small and become familiar with Grid Engine before deploying in the real production environment.

Installation

System Requirements

While it is possible to run Grid Engine on less powerful machines, it is recommended to have at least the following available disk space and memory:

	Master host	Execution host
-----------------------------------------------------------------
Free Memory	100MB	20 MB
Available Disk space	500MB	500MB

For clusters with more than several hundred execution hosts, and/or millions of dispatched jobs per month, it is recommended to have a powerful two-way machine as the master host. This way, the qmaster daemon and the job scheduler can each run on a separate processor. Also, the qmaster is fully multi-threaded, thus it can take advantage of any additional processors available in the system. In general, the busier the cluster is, the more powerful the master machine needs to be. In terms of memory requirements, the qmaster process can consume more than 2GB when there are hundreds of thousands of submitted jobs. Several production Grid Engine clusters have more than 1000 execution hosts, with millions of jobs dispatched per month, and dual-Opteron master hosts handle the workload quite nicely.

Operating Systems Supported

Grid Engine supports all the common operating systems and hardware platforms. As an open source project, the PowerPC Linux, the BSD, and the MacOSX ports were done by the Grid Engine community.

The Grid Engine project homepage has pre-compiled binaries for the common platforms available for download. For the less common ones, however, compiling from source is the only option (see Table 2). Last but not least, the availability of a shared file system (NFS) mounted on both the master and the execution hosts is highly recommended.

Getting Started

Installing a Grid Engine cluster usually takes less than half an hour. You just need to follow these steps carefully.

1. Go to the Grid Engine Project download page. Download the common files tarball. This tarball contains the install scripts and platform independent configuration files.

2. Download the platform-specific binary and data files tarball(s). Note that you will need to download a platform-specific tarball for each platform if you have a heterogeneous cluster.

3. Create the sgeadmin account on the master host and all the execution hosts. Note that the user ID (uid) must be the same cluster-wide, or else it will cause problems when Grid Engine writes to the shared filesystem (NFS). Using NIS may be easier for clusters with a large number of execution hosts.

4. Unpack the tarballs in the shared file system using the sgeadmin account in directory on the shared file system; this directory will be your SGE_ROOT.

sgeadmin> gunzip -c \
  sge-6.0u9-common.tar.gz |tar vxf -

sgeadmin> gunzip -c \
  sge-6.0u9-bin-lx24-amd64.tar.gz | tar vxf -

5. Log in as root and execute the qmaster install script:

# ./install_qmaster

The program will walk you through the basic cluster setup interactively with detailed explanations in each screen. Usually, the default value works, but there a few things you will need to know beforehand:

In the Grid Engine TCP/IP service screen, two unused port numbers are asked for: one for qmaster and one for execd. It is best to use the IANA assigned ports 6444 and 6445. Note that newer versions of /etc/services have the entries for the Grid Engine ports, and thus the install program will use those instead of asking.

Grid Engine supports multiple cells, with each cell roughly equivalent to an independent cluster. However, for most installations, it is better to configure only one cell and leave the name of the cell as "default". If multiple clusters are needed, it is best to pick a new SGE_ROOT for each cluster.

In the Setup spooling screen, you will be asked for a spooling method, and the default is berkeleydb spooling, which is not supported for NFSv3 or earlier versions. Unless the shared file system is NFSv4 or later, you will need to choose classic spooling.

If everything goes well, the install program will start the qmaster and scheduler daemon:

Grid Engine qmaster and scheduler startup
-----------------------------------------
Starting qmaster and scheduler daemon. Please wait
   starting sge_qmaster
   starting sge_schedd
Hit  to continue >>

6. Log in to the execution host as root and execute the install script:

# ./install_execd

Again, the default values usually work, so just press enter to pick the default values, and you will get the execd daemon up and running on the execution hosts:

Grid Engine execution daemon startup
------------------------------------
Starting execution daemon. Please wait ...
   starting sge_execd
Hit <RETURN> to continue >>

7. Run the install_execd script on the master host if you want it to run jobs.

Setting up the User Environment

Grid Engine command-line tools require several variables in the shell environment. Put the following line in the login script so that the environment will be set up correctly every time you log in.

For C shell:

source /default/common/settings.csh

For Bourne shell:

<your SGE_ROOT PATH>/default/common/settings.sh

Setting up a submit host with qconf

The qconf command can be used for changing the cluster configurations. To add a submit host, you can use the -as switch (i.e., add submit host).

# qconf -as td189
td189 added to submit host list

Similarly, an administrative host can be added with the -ah switch.

A Few Basic Grid Engine Commands

By now you should have a Grid Engine cluster running. So, let's run some basic Grid Engine commands. Here are the main ones I will discuss:

qsub -- Submit a batch job to Grid Engine.
qstat -- Show the status of Grid Engine jobs and queues.
qhost -- Show the status of Grid Engine hosts, queues, jobs.
qdel -- Delete Grid Engine jobs from queues.

Submitting Your First Job

You will need to create a job script. A job script is basically a regular shell script, but Grid Engine directives can be embedded. When Grid Engine reads in the job script, it parses the directives and modifies the job's default behavior.

A job script can have thousands of lines, with any number of Grid Engine directives embedded, and can spawn one or more other scripts. But a simple job script can be as simple as:

#!/bin/sh

sleep 1000
exit 0

A number of examples are in examples/jobs. To submit a job, you will use the qsub command:

rayson> qsub examples/jobs/sleeper.sh
Your job 1 ("Sleeper") has been submitted

The qsub command prints the job id assigned by the qmaster. The job id is unique within a cluster. Since this is the first job, the job id is 1.

To verify that the job is submitted to the cluster, use the qstat command:

rayson> qstat
job-ID  prior   name    user    state submit/    queue  slots ja-task-ID
                                      start at
------------------------------------------------------------------------
     1  0.55500 Sleeper rayson  qw    11/26/2006 02:57:55       1

The qhost command will show you the load of the machines in the cluster (the following output was collected from a cluster with two dual-processor nodes):

rayson> qhost
HOSTNAME        ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
------------------------------------------------------------------------
global          -               -     -       -       -       -       -
td159           lx24-amd64      2  0.00    2.0G  410.5M    4.0G  108.0K
td189           lx24-amd64      2  1.00    3.9G  327.2M    6.0G     0.0

By default, qhost without any options shows the load information of all the hosts in the cluster.

When the job starts, the state of the job reported by qstat will change from "qw" to "r", meaning that it has changed from the queued and waiting state to the running state.

rayson> qstat
job-ID  prior   name     user   state submit/    queue  slots ja-task-ID
                                      start at
------------------------------------------------------------------------
     1  0.55500 Sleeper  rayson r     11/26/2006 03:17:55 all.q@td159  1

To delete a submitted or running job, use the qdel command:

rayson> qdel 1
rayson has deleted job 1

Otherwise, when the job finishes, the qacct command will show the job's history.

rayson> qacct -j 1
========================================
qname        all.q
hostname     td159
group        Domain Users
owner        rayson
project      NONE
department   defaultdepartment
jobname      Sleeper
jobnumber    1
taskid       undefined
account      sge
priority     0
qsub_time    Sun Nov 26 02:57:55 2006
start_time   Sun Nov 26 03:17:55 2006
end_time     Sun Nov 26 03:18:55 2006
granted_pe   NONE
slots        1
failed       0
exit_status  0
ru_wallclock 61
...

Also note that the job's standard output and standard error are redirected to files in the home directory of the job's owner.

rayson> ls ~/sleep.*
sleep.e1  sleep.o1

Here are some more basic Grid Engine commands:

qhold and qrls -- Use qhold to tell Grid Engine to hold back jobs from execution, and use qrls to release the hold.
qrsh, qsh, qtcsh -- You can use the qrsh to start an interactive command, and Grid Engine will run it on the most lightly loaded host.

Advanced Topics

Setting up a small Grid Engine cluster or using the basic commands may seem easy, but what makes Grid Engine really powerful is that different scheduling and user management policies can be configured. Different sites have different scheduling requirements to match the resources available in the cluster, and the priority of different projects to which the jobs belong. I will only be able to cover some of the common situations here, and I will give pointers to other documents for more complex ones.

Automatic Job Re-run

By default, unless jobs are submitted with qsub -r y, jobs are not rerun even if they are dispatched on machines that crash in the middle of job execution. To enable automatic job re-run, do the following.

Set reschedule_unknown to 00:45:00 with qconf -mconf. Also run qconf -mq all.q, and set rerun to TRUE. Note that you can choose a period longer than 45 minutes (00:45:00), but if the time period is too short, then random network congestion can cause jobs to be rescheduled very often and affect the cluster's throughput.

Shadow Master

Since the master host handles all job scheduling and dispatching, any failure would cause job dispatch to stop. Furthermore, users would not be able to submit new jobs, and all Grid Engine commands would stop working. While dispatched jobs would continue to run, failure of the qmaster would cause a great impact in productivity. Luckily, the shadow master facility allows one or more shadow masters to monitor the qmaster and elect a new master machine if the original one stops responding.

The shadow master daemon can be run on any machines with administration role. When the qmaster stops updating the heartbeat file, the shadow master will start the qmaster and scheduler on the local host. Thus, the cluster will be able to function as long as at least one shadow master is not crashed.

The details on setting up a shadow master are documented in this HOWTO:

http://gridengine.sunsource.net/howto/shadow.html
maxujobs, max_jobs, and max_u_jobs

By default, there is no limit on the number of jobs a user can submit. Thus, a malicious user can keep on submitting jobs until the master host runs out of memory, essentially DoS'ing (Denial-of-Service) the whole cluster. To prevent this, there are two configuration parameters that the Grid Engine administrator can use:

max_u_jobs -- The number of active jobs each user can have simultaneously in Grid Engine.

max_jobs -- The total number of active jobs allowed in Grid Engine. A value of 0 means no limit; use qconf -mconf global to change the value in the configuration. Note that any unfinished jobs are counted as active jobs.

Similarly, maxujobs in the scheduler configuration controls the number of running jobs each user can have. Use qconf -msconf to change this value; a value of 0 means this limit is not set.

The Resource Quota feature in the next version (Grid Engine 6.1) provides administrators a flexible and powerful way to control resource consumption and limit.

Array Jobs

Quite often, you have a large number of mostly identical jobs to run, with the only difference being the input parameters and/or data sets. While it is possible to submit each as an independent job, Grid Engine has the array job facility to make managing these jobs easier by requiring only one array job to be submitted. You can stop or delete the whole array job with just one command, instead of having to stop or delete each of the jobs in the workflow.

Suppose you want to run 20 graphics-rendering jobs with 20 unique inputs, the ordinary way to do this is:

rayson> qsub render.sh data.1
rayson> qsub render.sh data.2
...
rayson> qsub render.sh data.20

Or, you can do this with an array job with 20 tasks:

rayson> qsub -t 1-20 render.array.sh data

You should organize the input files in such a way that it is easy to find inside in each array task. And, in the render.array.sh job script, you will need to find out which data file the current tasks should process by checking the environment variable $SGE_TASK_ID. The render.array.sh may look like:

#!/bin/sh

render.sh $1.$SGE_TASK_ID

Thus, the first array task will render data.1, and the second array task will render data.2.

Consumables and Load Sensors

The amounts of free memory on an execution host, or the number of software licenses, are examples of consumable resources. The resource is used by each job until it is completely consumed. And, when each of the jobs finishes, the resource is de-allocated and made available to the next job. For example, if there are only 10 software licenses available in the whole cluster, and more than 10 jobs requiring the software license are dispatched, then some of the jobs will not be able to run because they will not get the license checked out properly. A common practice is to create a load sensor to report to Grid Engine the amount of consumable resources available -- in our case, the number of software licenses. With this knowledge, Grid Engine can determine how many jobs should be dispatched.

Since this is a huge topic, I won't cover the details here. The configuration details are available at:

http://gridengine.sunsource.net/howto/consumable.html
http://gridengine.sunsource.net/howto/loadsensor.html
http://wiki.gridengine.info/wiki/index.php/Olesen-FLEXlm-Integration

Scheduling

Configuring scheduling policies is also a huge topic. A 23-page Sun Blueprint (Scheduler Policies for Job Prioritization in the N1 Grid Engine 6 System) written by a Sun Grid Computing Engineering Group developer is available at:

http://www.sun.com/blueprints/1005/819-4325.html

Graphical User Interface and Web User Interface

qmon -- qmon is a Motif-based X GUI for Grid Engine. Users can submit and manage jobs without touching the command line, and administrators can change configurations without remembering the command-line options.
xml-qstat -- xml-qstat is an open source project. Originally written by Chris Dagdigian on the BioTeam as a personal project while he was taking a Web Services and SOA night class, the xml-qstat is now a fully functional Web GUI for monitoring Grid Engine clusters. A number of sites are using it in production environments.

Before xml-qstat was released, a number of Web GUIs for Grid Engine existed. The biggest difference is that xml-qstat takes advantage of the XML format output of qstat, instead of parsing the text output, and thus xml-qstat is more robust (see Figures 2 and 3).

Application Programming Interface -- DRMAA

DRMAA (Distributed Resource Management Application API) is the default interface for application developers to create jobs within an application and submit them to Grid Engine, without the need to write wrappers to call any of the Grid Engine command-line programs. Applications can query job status and can wait for jobs to finish. Language bindings for C/C++, Java, Perl, Python, and Ruby are available.

DRMAA is supported by other job schedulers, including Condor, PBS (Portable Batch System), Gridway, and XGrid. Currently, the EGEE project is adopting DRMAA. Thus, an application that uses DRMAA to create jobs can interface with most of the common job schedulers. An interesting example is the product RepliCator by eXludus that handles data movement in compute clusters. DRMAA is also used by meta-schedulers to interface with local job schedulers and form a compute grid to further share and load balance jobs between clusters.

Conclusion

Grid Engine is an open source job scheduler used in many high-throughput clusters. With the continuous development resources from Sun, and contributions from the open source community, the feature list just keeps growing in each Grid Engine release. It is impossible to document all the useful features in Grid Engine in one article. For those who want to learn more, joining the Grid Engine mailing list is a good way to go beyond the first step.

References

N1 Grid Engine -- http://www.sun.com/software/gridware/

Grid Engine project homepage -- http://gridengine.sunsource.net/

Sun independent Grid Engine page -- http://www.gridengine.info/

Grid Engine wiki -- http://wiki.gridengine.info/wiki

xml-qstat project -- http://www.xml-qstat.org/

DRMAA work group -- http://www.drmaa.org/

BioTeam -- http://www.bioteam.net/

Rayson has been doing system programming since 2000. He is interested in parallel systems, operating system internals, and processor micro-architectures. When not in front of the computer, he enjoys sailing, and jazz music. He can be reached at rayrayson@gmail.com.