Rich can be contacted at rich@amtec.com.
The S language is a high-level, object-oriented system designed for data analysis and graphics. Originally written by Richard A. Becker, John M. Chambers, and Allan R. Wilks of AT&T Bell Laboratories' Statistics Research Department, the S language is useful for a wide range of applications. In fact, most current S users aren't involved with statistics, and most S applications focus on basic quantitative computations and graphics.
S is relatively easy to work with. In its simplest form, you type an expression, and S evaluates it and displays the answer (something like a desk calculator). However, S can operate with large collections of data at once, so one expression might produce a graph, fit a line to a set of points, or carry out similarly complex operations.
A commercial implementation of the S language can be found in the S-Plus data analysis and statistics software from Mathsoft's StatSci Division (Seattle, WA). (The source code for S is licensed by Bell Labs, but distributed exclusively by StatSci.) S is available for systems ranging from Windows-based PCs to UNIX-based workstations (HP, SGI, Next, Sun, and others).
The S-Plus system consists of the S language, 1200 or so language extensions that deal with statistical mathematical and analysis functions, and a development environment. More specifically, the major areas in which S-Plus extends S are time series, survival analysis, "modern regression" (including LMS regression and projection-pursuit regression), classical statistical tests, graphic-device drivers, and dynamic loading. All of the examples presented in this article are based on the S-Plus implementation.
The real advantage of the object-oriented approach is evident when designing a large system that will do similar, but not identical, things to a variety of data objects. By specifying classes of data objects for which identical effects will occur, you can define a single generic function that embraces the similarities across object types, but permits individual implementations or methods for each defined class. For example, if you type a print(object) expression, you expect S to print the object in a suitable format. If all the various predefined printing routines were combined into a single function, the print function would need to be modified every time a new class of objects was created. With object-oriented programming, however, print is truly generic; it need not be modified to accommodate new classes of objects. Instead, the objects carry their own methods with them. Thus, when you create a class of objects, you can also create a set of methods to specify how those objects will behave with respect to certain generic operations.
In S, both character vectors and factors are originally created from vectors of character strings, and when printed, both give essentially the same information; see Listing One. The distinct look of the printed factor arises because factors are a distinct class of object, with their own printing method, the function print.factor.
Generic functions (that is, functions such as print or plot that take an arbitrary object as argument) in S tend to be extremely simple thanks to the utility function UseMethod, an internally implemented function that finds the appropriate method and evaluates a call to it. As shown in Listing Two (a), the typical generic function consists of a single call to UseMethod. When the generic function is called, UseMethod determines the class of the argument x, finds the appropriate method, then constructs and evaluates a call of the form method (x, ... ), where "..." represents additional arguments to the method.
Although most generic functions have the simple structure in Listing Two (a) for print and plot, a slightly more complicated definition may be needed. For example, the assign generic function stores objects in different classes of databases; therefore, it's important to assign not the class of the assigned object but the class of the assigned database. The call to UseMethod has a second argument specifying which assigned argument is to be searched for its class attribute; see Listing Two (b).
The browser function acts generically when called with an argument, but has a specific action when called with no arguments (in part because you need an argument to find a method). This is embodied in its definition, as in Listing Two (c).
In S, an object's class attribute determines its method. If the class attribute is missing, the default class is assumed. For example, factors are of class factor, while vectors, having no class attribute, are of class default. (Data types that existed before S-Plus 3.0 have no class attribute, because classes and methods were new with that release. Thus, vectors, matrices, arrays, lists, and time-series objects are classless.)
A class attribute is just a character vector of any length. The first element in the class attribute is the most-specific class of the object. For example, an ordered factor has class attribute c("ordered", "factor"), and is said to be class "ordered." (Ordered factors have a specific level ordering.) Subsequent elements in the class attribute denote classes from which the specific class inherits.
Methods are named using the convention action.class, where action is the name of the generic function, and class is the class to which the method is specific. For example, plot.factor is the plot method for factors, and is.na.data.frame is the missing-value test method for data frames.
If the most-specific class of an object has no method, S searches the classes from which the object inherits for the appropriate method. Every class inherits from class default, so the default method is used if no more-specific method exists.
Inheritance lets you define a new class using only those features that distinguish it from classes from which it inherits. To take full advantage of this, you must define methods incrementally so that a specific method can act like a pre- or postprocessor to a more general method. For example, a method for printing ordered factors should be able to draw on an existing method for printing factors. This is done via NextMethod, which finds the next most-specific method after the current method and creates and evaluates a call to it. Like UseMethod, NextMethod is internally implemented.
For instance, Listing Three (a) is the definition of print.ordered. Like all print methods, print.ordered returns its first argument. Values for all methods should be compatible. In this case, the call to NextMethod finds the function print.factor. print.ordered appends the ordered levels to its output. The specific method for ordered factors is a postprocessor to the method for factors in general, and most of print.factor is preprocessing for print.default; see Listing Three (b).
To build objects of a specific class, you need to define a constructor (or generator) function. Typically, generator functions have the name of the object they create--vector, factor, and so on. Listing Three (c) is the definition of the factor generator function. Here, the generator function explicitly sets the class attribute. Not all generator functions produce objects with a nonnull class attribute. For example, numeric generates numeric vectors, which have no class attribute. You can view the class of any object with the class function, as in Listing Three (d), or you can modify it by using class on the left side of an assignment, as in Listing Three (e). However, modifying the class attribute should not be done lightly: Assigning a class implies that the object is a compatible-value object of that class.
Object-oriented programming often distinguishes between the public (or external) view and the private (or internal) view of a class implementation. The public view is the conceptual view of the object and the functions that operate on it. Ideally, the casual user should not be concerned with the private view--the public view should be adequate for most situations.
When developing new methods, you must be clear at all times about which view you are using, because the private view, unlike the public view, is implementation dependent. If the implementation of a class changes, examine methods that use the private view to see if they are still valid. The private view is generally more efficient, particularly for the most commonly used methods, but public methods are easier to maintain.
New classes in S are created by identifying one or more defining attributes (or, for objects derived from S's list type, defining components) shared by all members of the class, and then assigning a class attribute to all objects containing those attributes. The class attribute allows the new class to use the S generic dispatch mechanism.
As with many programming tasks, the key to successfully defining new classes is to abstract the identifying features of a given data object, clearly distinguishing objects within the class from those outside it. For example, a data object with the attribute dim is necessarily an array. Testing for this attribute is equivalent to testing for membership in the class.
To illustrate new S-Plus classes, I'll define a class of graphical shapes. In this model, shapes are specified as a sequence of points. Open shapes, such as line segments and arcs, are specified by their endpoints. Closed shapes, such as circles and squares, are specified by starting points and points that uniquely determine the shape. A circle is specified as a center and a point on the circle. A square is specified by one corner and a side length, while a rectangle is specified by two diagonal corners.
The goal of defining these shapes is to create a rudimentary freehand drawing using a graphics window. For this reason, I'll define the classes so that objects can be created easily using a sequence of mouse clicks via the locator function. Listing Four (a) is a generator function for circles. The circle function lets you express the circle in several natural ways, thanks to the helper function as.point, defined in Listing Four (b). You can give the center as either a list containing x,y components, as you might get from the locator function, or you can give it as an -xy vector. You can give the radius as a scalar, or give a second point from which the radius can be calculated. Listing Five (a) shows how to define a simple circle from the S-Plus command line.
You store the circle as a list for ease of access to individual elements; however, the default printing for lists seems rather formal for a circle, where we only need to see a center and radius. Thus, it makes sense to define a method for use with the print generic function; see Listing Five (b). Listing Five (c) is a simpler method that provides the same results.
When a method is defined, its arguments should match those of the generic. It may have extra arguments (hence the "..." built into every generic).
You define the draw function as a generic function; you can draw shapes with draw, and as long as you define appropriate methods for all classes of shapes, draw will operate correctly; see Listing Five (d). The call to UseMethod signals the evaluator that draw is generic. The evaluator should therefore first look for a specific method based on the class of the object, starting with the most-specific class and moving up through less-specific classes until the most-general class is reached. All S-Plus objects share the same general class, class default. Listing Five (e), for example, is a version of the method draw.circle. If you call draw with an object of class circle as its argument, the S evaluator finds the appropriate method and draws a circle on the current graphics device.
Three groups of S functions, all defined as calls to .Internal, are treated specially by the methods mechanism: the Ops group, containing standard operators for arithmetic, comparison, and logic; the Math group, containing the elementary, vectorized mathematics functions (sin, exp, and so on); and the Summary group, containing functions (such as max and sum) that take a vector and return a single summary value. Table 1 lists the functions in each group.
Rather than writing individual methods for each function in a group, you can define a single method for the group as a whole. There are 17 functions in the Ops group (19 if you count both the unary and infix forms of + and -) and 26 in the Math group, so the savings in programming can be significant. Of course, in writing a group method, you must ensure that it gives the appropriate answer for all functions in the group.
Group methods have names of the form group.class. Thus, Math.factor is the Math group method for objects of class factor, and Summary.data.frame is the Summary group method for objects of class data.frame. If the method handles all the functions in the group in the same way, it can be quite simple; see Listing Six (a). (One caution: The Summary group does not include either mean or median, both of which are implemented as S-Plus code.)
The economy of the group method is still significant even if a few of the functions need to be handled separately. As an example of a nontrivial group method, I'll define a group of operators for the finite field Z7, which consists of the elements {a7=0,b7=1,c7=2,d7=3,e7=4,f7=5,g7=6}$ (0 to 6). The usual operations are defined so that any operation on two elements of the set yields an element of the set; for example, c7*e7=b7=1,d7/f7=c7=2.... Addition, subtraction, and multiplication are simply the usual arithmetic operations performed modulo 7, but division requires extra work to determine each element's multiplicative inverse. Also, elements of the field can be meaningfully combined with integers, but not with other real numbers or complex numbers.
You can define a new class, zseven, to represent the finite field Z7. Listing Six (b) shows the generator function for this. Listing Six (c) shows the value returned by a typical input vector. You suppress the printing of the class attribute by defining a method for print, as in Listing Six (d). But the significant part of the work is to define a group method Ops.zseven that will behave correctly for all 17 functions in the Ops group. Most of these are binary operations, so you define your method to have two arguments, e1 and e2, as in > Ops.zseven <--function(e1,e2){}. While performing calculations, you want to ignore the class of your operands, so you begin with the assignment e1 <-- unclass(e1). You do not unclass e2 immediately, because the operation may be one of the unary operators (+, -, and !). You also test that e1 is a value that makes sense in Z7 arithmetic; see Listing Seven (a). (The object .Generic is created in the evaluation frame, and contains the name of the function being called.)
You can now include e2 in your calculations; division must be treated specially, but everything else passes on to the generic methods incorporated in S-Plus's internal code; see Listing Seven (b). Finally, ensure that numeric results are of class zseven, while logical results are passed back unchanged, as in Listing Seven (c). The complete method looks like Listing Seven (d). Alternatively, you can ignore the special case of division in the group method and write an individual method for division, as in Listing Eight.
Individual methods override group methods. In this example, the overhead of testing makes it simpler to incorporate the special case within the group method. A working version can be defined as Listing Nine. Listings Ten (a) and (b) test a few examples of this, and produce the expected answers.
Replacement functions typically replace either an element or attribute of their arguments and appear on the left side of an S assignment arrow. S interprets the expression f(x) <-- value as x <--"f<--"(x, value), so that the replacement function to be defined has a name of the form f<--. All replacement functions act generically: Methods can be written for them.
In class zseven, you define a replacement to ensure that any new value remains in the class, that is, that all elements in an object of class zseven are from the set {0, 1, 2, 3, 4, 5, 6}. The public method in Listing Eleven accomplishes this; it does not use any special knowledge of the implementation of the class zseven, just the public view that zseven is simply the integers mod 7.
Becker, R.A., J.M. Chambers, and A.R. Wilks. The New S Language. London, U.K.: Chapman and Hall, 1988 (the "Blue Book").
Chambers, J.M. and T.J. Hastie. Statistical Models in S. London, U.K.: Chapman and Hall, 1992 (the "White Book").
Spector, P. An Introduction to S and S-Plus. Belmont, CA: Duxbury Press, 1994.
Table 1: Functions affected by group methods.
Copyright © 1995, Dr. Dobb's Journal
Group Functions in Group
Ops +(unary and infix), - (unary and infix), *, /, !
(unary not), sign, ^, %%, %/%, <, >, <=, >=, ==, !=, |, &.
Math abs, acos, acosh, asin, asinh, atan, atanh,
ceiling, cos, cosh, cummax, cumsum, cumprod, exp,
floor, gamma, lgamma, log, log10, round,
signif, sin, sinh, tan, tanh, trunc
Summary all, any, max, min, prod, range, sum.Listing One
> xxx <- c("White", "Black", "Gray","Gray", "White","White")
> yyy <- factor(xxx)
> print(xxx)
[1] "White" "Black" "Gray" "Gray" "White" "White"
> print(yyy)
[1] White Black Gray Gray White White
Listing Two
(a)
> plot
function(x, ...)
UseMethod("plot")
> print
function(x, ...)
UseMethod("print")
(b)
> assign
function(x, value, frame, where = NULL)
UseMethod("assign", where)
(c)
> browser
function(object, ...)
if(nargs()) UseMethod("browser") else {
nframe <- sys.parent()
msg <- paste(deparse(sys.call(nframe)), collapse = " ")
if(nchar(msg) > 30)
msg <- paste(substring(msg, 1, 30), ". . .")
browser.default(nframe,
message = paste("Called from:", msg))
}
Listing Three
(a)
> print.ordered
function(x, ...)
{
NextMethod("print")
cat("\n", paste(levels(x), collapse = " < "), "\n")
invisible(x)
}
(b)
> print.factor
function(x, quote = F, abbreviate.arg = F, ...)
{
if(length(xx <- check.factor(x)))
stop(paste(
"cannot be interpreted as a factor:\n\t", xx))
xx <- x
l <- levels(x)
class(x) <- NULL
if(abbreviate.arg)
l <- abbreviate(l)
if(any(is.na(x))) {
l <- c(l, "NA")
x[is.na(x)] <- length(l)
}
else x <- l[x]
NextMethod("print", quote = quote)
if(any(is.na(match(l, unique(x))))) {
cat("Levels:\n")
print.atomic(l)
}
invisible(xx)
}
(c)
> factor
function(x, levels = sort(unique(x)),
labels = as.character(levels), exclude = NA)
{
if(length(exclude) > 0) {
storage.mode(exclude) <- storage.mode(levels)
# levels <- complement(levels, exclude)
levels <- levels[is.na(match(levels,exclude))]
}
y <- match(x, levels)
names(y) <- names(x)
levels(y) <- if(length(labels) == length(levels))
labels else if(length(labels) == 1
)
paste(labels, seq(along = levels), sep = ""
)
else stop(paste("invalid labels argument, length",
length(labels), "should be", length(
levels), "or 1"))
class(y) <- "factor"
y
}
(d)
> class(kyphosis)
[1] "data.frame"
(e)
> class(myobject) <- "myclass"
Listing Four
(a)
circle <-
function(center, radius, point.on.edge)
{
center <- as.point(center)
val <- NULL
if(length(center$x) == 2) {
val <- list(center = list(x = center$x[1],
y = center$y[1]), radius = sqrt(
diff(center$x)^2 + diff(center$y)^2
))
}
else if(length(center$x) == 1) {
if(missing(radius)) {
point.on.edge <- as.point(point.on.edge)
}
else if(is.atomic(radius)) {
val <- list(center = center, radius = abs(radius))
}
else {
point.on.edge <- as.point(radius)
}
if(is.null(val)) {
val <- list(center = list(x =
center$x[1], y = center$y[1]), radius = sqrt((
point.on.edge$x - center$x)^
2 + (point.on.edge$y - center$y)^2))
}
}
class(val) <- "circle"
val
}
(b)
as.point <-
function(p)
{
if(is.numeric(p) && length(p) == 2)
list(x = p[1], y = p[2])
else if(is.list(p) && !is.null(p$x) && !is.null(p$y))
p
else if(is.matrix(p))
list(x = p[, 1], y = p[, 2])
else stop("Cannot interpret input as point")
}
Listing Five
(a)
> simple.circle <- circle(center = c(0.5, 0.5), radius = 0.25)
> simple.circle
$center:
$center$:
[1] 0.5
$center$y:
[1] 0.5
$radius:
[1] 0.25
attr(, "class"):
[1] "circle" "closed"
(b)
print.circle <-
function(x, ...)
{
cat(" Center: x =", x$center$x, "\n",
" y =", x$center$y, "\n",
"Radius:", x$radius, "\n")
}
(c)
> simple.circle
Center: x = 0.5
y = 0.5
Radius: 0.25
(d)
draw <-
function(x, ...)
UseMethod("draw")
(e)
draw.circle <-
function(x, ...)
{
center <- x$center
radius <- x$radius
symbols(center, circles = radius, add = T, inches = F, ...)
}
Listing Six
(a)
> Summary.data.frame
function(x, ...)
{
x <- as.matrix(x)
if(!is.numeric(x))
stop("not defined on a data frame with non-numeric variables")
NextMethod(.Generic)
}
(b)
zseven <-
function(x)
{
if(any(x %% 1 != 0)) {
x <- as.integer(x)
warning("Non-integral values coerced to integer"
)
}
x <- x %% 7
class(x) <- "zseven"
x
}
(c)
> zseven(c(5,10,15))
[1] 5 3 1
(d)
print.zseven <-
function(x,...)
{
x <- unclass(x)
NextMethod("print")
}
Listing Seven
(a)
# Test that e1 is a whole number
if(is.complex(e1) || any(e1 %% 1 != 0)) stop(
"Operation not defined for e1") #
# Allow for unary operators
if(missing(e2)) {
if(.Generic == "+")
value <- e1
else if (.Generic == "-")
value <- - e1
else value <- !e1
}
(b)
else {
e2 <- unclass(e2) #
# Test that e2 is a whole number
if(is.complex(e2) || any(e2 %% 1 != 0)) stop(
"Operation not defined for e2") #
# Treat division as special case
if(.Generic == "/")
value <- e1 * inverse(e2, base = 7)
else value <- NextMethod(.Generic)
}
(c)
switch(mode(value),
numeric = zseven(value),
logical = value)
(d)
Ops.zseven <-
function(e1, e2)
{
e1 <- unclass(e1) #
# Test that e1 is a whole number
if(is.complex(e1) || any(e1 %% 1 != 0)) stop(
"Operation not defined for e1") #
# Allow for unary operators
if(missing(e2)) {
if(.Generic == "+")
value <- e1
else if(.Generic == "-")
value <- - e1
else value <- !e1
}
else {
e2 <- unclass(e2) #
# Test that e2 is a whole number
if(is.complex(e2) || any(e2 %% 1 != 0)) stop(
"Operation not defined for e2") #
# Treat division as special case
if(.Generic == "/")
value <- e1 * inverse(e2, base = 7)
else value <- NextMethod(.Generic)
}
switch(mode(value),
numeric = zseven(value),
logical = value)
}
Listing Eight
"/.zseven" <-
function(e1, e2)
{
e1 <- unclass(e1)
e2 <- unclass(e2) #
# Test that e1 is a whole number
if(is.complex(e1) || any(e1 %% 1 != 0)) stop(
"Operation not defined for e1") #
# Test that e2 is a whole number
if(is.complex(e2) || any(e2 %% 1 != 0)) stop(
"Operation not defined for e2") #
zseven(e1 * inverse(e2, base = 7)
}
Listing Nine
inverse <-
function(x, base = 7)
{
set <- 1:base
# Find the element e2 of the set such that e2*x=1
n <- length(x)
set <- outer(x, set) %% base
return.val <- numeric(n)
for(i in 1:n) {
return.val[i] <- min(match(1, set[i, ]))
}
return.val
}
Listing Ten
(a)
> x7 <- zseven(c(3,4,5))
> y7 <- zseven(c(2,5,6))
> x7 * y7
[1] 6 6 2
> x7 / y7
[1] 5 5 2
> x7 + y7
[1] 5 2 4
> x7 - y7
[1] 1 6 6
> x7 == y7
[1] F F F
> x7 >= y7
[1] T F F
> -x7
[1] 4 3 2
(b)
> -x7 + x7
[1] 0 0 0
Listing Eleven
"[<-.zseven" <-
function(x, ..., value)
{
if (is.complex(value) || value %% 1 != 0)
stop("Replacement not meaningful for this value")
x <- NextMethod("[<-")
x <- x %% 7
x
}