The S Programming Language

Dr. Dobb's Journal February 2000

By Al Stevens

Al is a DDJ contributing editor. He can be contacted at astevens@ddj.com.

Today is Judy's birthday. My brother Walt is visiting from his home in Jamaica, and I suggested we all go out for dinner and some bar-hopping to celebrate. He seemed surprised and asked wouldn't I prefer instead to be alone with Judy on this special night for "a quiet romantic evening together." I looked at him incredulously and, after a pause, blurted out, "But she's 58!" She was coming around the corner out of the kitchen at the time and heard the whole thing. It's been kind of quiet around here since then. This is going to cost me.

The Editor and its Scripts

In January of last year, I introduced a programmer's editor project (named unimaginatively "Editor"), which is meant to become the integrated editor in Quincy 99. Quincy is a Win32 integrated development environment that supports development of GUI and console Win32 applications. Quincy uses gcc-mingw32, an open source port of the gnu C/C++ compiler suite that runs under Win32 and supports calls to the Win32 API. You can find Quincy 99 and instructions for where to find the compiler at http://www.midifitz.com/alstevens/ quincy99/ and from DDJ.

When I integrate the Editor program into Quincy 99, I'll also do a major facelift to the development suite including a Standard C++ library, improved debugging features, and better project management; Quincy 99 then becomes Quincy 2000. That is the plan. I'm still waiting for the gcc volunteers to finish the Standard library, which should be soon. I have a preliminary version of it and it looks good so far. In the meantime, I continue to work on the Editor. I first discussed that project here about a year ago. It's currently available as a standalone program, and I encourage everyone to download it, use it, and send comments to me about it. You will find the Editor at http://www.midifitz .com/alstevens/editor/.

User-Defined Improvements

I tested Quincy 99 with a substantial number of programmers downloading and testing it, and, with their generous help, made many improvements and corrections to the program.

Teachers and students around the world are using Quincy because it is a free Win32 development platform and because it resembles a high-end IDE. Several of these users have offered suggestions to improve Quincy's performance in the classroom environment, which often involves networks. Although its original purpose was to support a single user in a C++ self-teaching situation, these students and teachers convinced me to make Quincy work in a network, and give users more control over where to find source files and where the compiler writes compiled files.

An oft-repeated request was to have the editor highlight syntax -- keywords and comments -- with different text colors. To that purpose, I added syntax highlighting to the Editor program, and then I worried about it. The Editor is an exercise in using Standard C++ containers, iterators, and algorithms to implement text editing and template abstractions to implement selected block marking and undo/redo. I refer you to the columns in the beginning months of 1999 for discussions of those features.

The most elegant programming solutions are not always the most efficient, unless of course, efficiency is your sole measure of elegance. In my view, elegant code is reliable, reusable, maintainable, extensible, and, most importantly, readable. After that, it can be efficient if possible. Squeezing that last nanosecond out of a tightly written algorithm is pointless if the optimized code has become too abstruse for anyone to understand. Having written editors in the past, I understand that maintaining a text buffer, rendering the text on the screen, and keeping the data and its visual representation in sync can be a time-intensive process. When I wrote text editors for the machines of yore, I employed cryptic optimizations to minimize the cycles consumed by those operations. I worried that adding code to the Editor to scan each displayed line of text for keywords and comments and changing colors accordingly might add too much overhead and make scrolling and paging look jerky. Setting aside my concerns, I got text highlighting working on my trusty old P300. Then, says I, what is the least amount of hardware that anyone ought to expect this Editor and Quincy to run on? The slowest thing I have capable of running Windows 98 is a P120 laptop, which, in this day of 700-Mhz machines, is really low end. I figured if the Editor works on that old laptop, it should be acceptable for any target suitable for Win32 development. (This attitude is unlike the Bill Gates model of software development, which is: By the time you get it ready to ship, the hardware will have caught up.) The Editor with syntax highlighting works fine on that old P120.

A Resurrected Script

Never mind going back only one year to revisit a project, how about going back 10 years? In the May 1989 issue, I described a homebrew C variant named "S," which is a script language I designed for applications that need scripts. This was in the days before Javascript and VBA. I implemented an S interpreter and provided a shell program that used the interpreter to run source code programs from the command line. I designed the interpreter to be reusable; an application would provide a shell process and some intrinsic functions that users of scripts could call. The following month, I integrated the interpreter into a communications program named "Smallcom," which was an ongoing project for the column at that time. The application and the interpreter were written in C and ran in MS-DOS text mode from the command line.

The Editor program needs a scripting tool to implement complex macros. It already has a simple keystroke macro recorder, which works well for repetitive editing tasks but cannot implement features such as smart indenting, brace matching, and such. Originally, I put hooks into the Editor so that users, who are presumably C++ programmers, can implement such features by extending the program in source code and recompiling it; the script language was C++ itself. Something about that approach stuck in my craw. Real program editors have real script facilities.

I retrieved the old S interpreter code to see if it could serve as a script interpreter for the Editor program. It looked as though it could, but I wrote it in C and used many archaic C idioms that contemporary C++ programmers, or at least this C++ programmer, find cumbersome. So I rewrote the interpreter in C++, and you can download it. The package includes a shell program that tests the interpreter by loading and executing text source code files written in the S language; see "Resource Center," page 7. I haven't integrated the interpreter into the Editor, yet, but that's next. If you have any of the Dr. Dobb's CD releases, you might want to compare this month's C++ version with the C version from 10 years ago.

The S Programming Language

S is a small variant of C that implements functions, local and global variables, literal constants, and three data types: char, int, and string. S supports for, while, if, and else. It has no preprocessor. An S script, like a C program, must have a main function to get things started. The shell that drives the interpreter provides intrinsic functions that S scripts can call. These functions may return any of the data types and may accept any of the data types as arguments.

When S becomes the Editor's script language, the Editor will provide intrinsic functions that return a string of text from a specified line in the text buffer, return the current insertion cursor position, return the range of a selected block, position the insertion cursor, insert text into and delete text from the buffer, do search and replace operations, support generic dialog box data entry, and anything else that I need when I start writing scripts.

Listing One is si.cpp, the shell program that exercises the interpreter. It demonstrates what a shell process must provide to use the interpreter. The shell has to provide two functions that the interpreter uses to get script source code characters to interpret. Those functions are named getsource and ungetsource, and the shell's version of them, which simply call standard C's getc and ungetc, are at the bottom of si.cpp. They call the C functions to get characters from the script file that the shell opens from a command-line argument.

The shell also provides a table of Intrinsic objects that describe the shell's intrinsic functions to the interpreter. The example shell in si.cpp provides four intrinsic functions -- printf, getchar, putchar, and getversion -- to illustrate how the interpreter interacts with intrinsic functions. The table provides a string with the function's return type and name and the address of the shell function to call. The casts in the intrinsic table initializers coerce the functions into having the same signature for purposes of initializing the array.

The interpreter passes all arguments to intrinsic functions as a pointer to an array of int variables. For string arguments the corresponding entry in the array is really a pointer to a null-terminated character array. Arguments of type int and char are passed as ints. When an intrinsic function returns a string to the interpreter, as the getversion function does, it passes a char* value. The interpreter copies the string text into its own memory, so the intrinsic function's copy can be safely discarded.

As the interpreter compiles the source code into bytecode and, later as it interprets the source code, it does some syntax checking. If it finds an error, the interpreter throws an exception of type SIException with a code that identifies what is wrong, a (possibly empty) std::string object with some text that expands on the error, and the line number in the source code where the error was found. The example shell translates the error into a message to display on the console.

Listing Two is interp.h, the header file that an application shell includes to use the interpreter. The application instantiates an object of type SInterpreter, with the address of the array of Intrinsic objects as an initializer. The application then calls the interpreter's interpret function, which reads the source code and runs the script.

Observe that both si.cpp and interp.h have namespace statements that are commented out. This is due to a bug in the Visual C++ 5.0 STL container templates or in the compiler, I don't know which. Instantiating STL containers parameterized on types that have scope qualifiers causes the compiler to issue an error that the type name is not known. I can uncomment the statements, and the gcc compiler compiles the programs without error. For the same reason, interp.h declares the Intrinsic, Token, Datum, and Symbol classes outside the SInterpreter class. Those classes really ought to be inside SInterpreter or in a namespace, but VC++ 5.0 won't permit it. Because Editor is still an MFC application, I have to use VC++ to compile it, and I must keep the interpreter compatible with VC++.

The interpreter is implemented in interp.cpp (available electronically). It uses typical interpreter logic beginning with a lexical scan that converts the source code into bytecode. Then it interprets the bytecode to run the program with a recursive descent parser. Just now the bytecode retains the text identifiers and searches the symbol table every time it encounters an identifier to resolve and dereference. I hope later to replace that logic by building a table of identifiers and using offset tokens in the bytecode instead of string identifiers. I also plan to replace the interpreting recursive descent parser by compiling expressions into a postfix stack architecture to improve performance. Eventually I can differentiate between source-code scripts and compiled scripts. Then maybe a debugger. Too much code, too little time.

Abstraction With #define

Some C++ programmers don't like the preprocessor -- and for good reason. The language keeps taking on features that don't get along well with the preprocessor. An example is namespaces. The preprocessor does its #define translations without considering namespaces. Consequently, any macro a program #defines, irrespective of namespaces, has the potential for colliding with things that are otherwise properly protected by namespaces. That's why they tell you not to use underscore prefixes on your identifiers, particularly with your #define macros. Identifiers that begin with one underscore followed by an uppercase letter or with two underscores are reserved for the language implementation, and if your macro should happen to collide with something in a standard header, it could cause all kinds of trouble.

Unlike many of my colleagues, I kind of like #define. Observe how I used it in interp.h to define an abstraction of the overloaded operators for the Datum class. I can hear the horrified gasps of disapproval already. Abstraction with #define? How simply awful!

The Datum class represents objects of the S language types. Each Datum object is either a string, char, or int. The class overloads the S language operators so the interpreter can perform those operations on objects of the types declared in the scripts. The code to overload most of the arithmetic operators is the same with the exception of the operator itself. Same with the relational operators and the unary operators. The only mechanism in C++ for passing an operator as an argument to a function is provided by the function-like macros of the #define preprocessor directive. There is no other way to do it. I wrote UNARY, LOGICAL, ARITHMETIC, and RELATIONAL macros to form abstractions of reusable code for overloaded operators. Observe those macros and how the class calls them. Later, if I want to add more operators to S (it doesn't support bitwise logical operators, or the += and -= arithmetic operator formats, for example), I simply add another macro call statement to the class and put the code in the interpreter to use the operator.

If that kind of programming offends some sensibilities, so be it. I think it's kind of elegant, myself.

DDJ

Listing One

#include <stdio.h>
#include "interp.h"
// ----- intrinsic functions
int iprntf(int* p)           //   printf   
{
    printf(reinterpret_cast<char*>(p[0]),p[1],p[2],p[3],p[4]);
    return 0;
}
int igtch()                 //  getchar
{
    return getchar();
}
int iptch(int* c)           //  putchar   
{
    return putchar(*c);
}
char* getver()              // return a string
{
    return "Version 1.0";
}
Intrinsic funcs[] = {
    Intrinsic("int printf",       reinterpret_cast<ifunc>(iprntf)),
    Intrinsic("int getchar",      reinterpret_cast<ifunc>(igtch) ),
    Intrinsic("int putchar",      reinterpret_cast<ifunc>(iptch) ),
    Intrinsic("string getversion",reinterpret_cast<ifunc>(getver)),
    Intrinsic("",        0)
};
// ---------- error messages 
char *erm[]={  "Unexpected end of file", "Unrecognized",
               "Duplicate ident",        "Undeclared ident",
               "Syntax Error",           "Unmatched {}",
               "Unmatched ()",           "Missing",
               "Not a function",         "Misplaced break",
               "Out of place",           "Not an identifer",
               "Mismatched arguments",   "Divide by zero",
               "Invalid constant",       "No main function"
};
static FILE *fp;
int main(int argc, char *argv[])
{
    if (argc == 2)  {
        if ((fp = fopen(argv[1], "r")) != 0) {
            try {
                SInterpreter si(funcs);
                si.interpret();
            }
            catch (SIException sex) {
                printf("\n%s %s on line %d\n",erm[sex.ercode], 
                            sex.msg.c_str(), sex.lineno);
            }
            fclose(fp);
        }
    }
    return 0;
}
// ----- functions that the interpreter requires
int getsource(void)     {   return getc(fp);    }
void ungetsource(int c) {   ungetc(c, fp);      }

Back to Article

Listing Two

// ---------------- interp.h --------------------
#include <vector>
#include <string>

// namespace DDJScriptInterpreter   {

// ----------- error codes
enum errs { EARLYEOF,           UNRECOGNIZED,
            DUPL_DECLARE,       UNDECLARED,
            SYNTAX,             BRACERR,
            PARENERR,           MISSING,
            NOTFUNC,            BREAKERR,
            OUTOFPLACE,         NOTIDENT,
            MISMATCHEDARG,      DIVIDEERR,
            INVALIDCONSTANT,    NOMAIN     };
class SIException {
public:
  errs ercode;
  int lineno;
  std::string msg;
  SIException(errs er = SYNTAX, int lno = 0, std::string m = std::string()) : 
                ercode(er), lineno(lno), msg(m)
        {  }
};
typedef int(*ifunc)(void*);
// --- intrinsic function table (provided by shell application)
class Intrinsic {
public:
    std::string signature;
    ifunc fn;
    Intrinsic(const std::string& sig = std::string(), ifunc f = 0) : 
                signature(sig), fn(f)
        {  }
};
typedef short int token;
enum DatumType { unknown, number, strng };
#define UNARY(op)                                                       \
Datum operator op () const                                              \
{                                                                       \
    nostring();                                                         \
    return Datum(op value);                                             \
}
#define RELATIONAL(op)                                                  \
bool operator op (const Datum& d) const                                 \
{                                                                       \
    sametype(d);                                                        \
    return (type == strng) ? (strval op d.strval) : (value op d.value); \
}
#define ARITHMETIC(op)                                                  \
Datum operator op (const Datum& d) const                                \
{                                                                       \
    nostring();                                                         \
    d.nostring();                                                       \
    return Datum(value op d.value);                                     \
}
#define LOGICAL(op)                                                     \
bool operator op (const Datum& d) const                                 \
{                                                                       \
    nostring();                                                         \
    d.nostring();                                                       \
    return value op d.value;                                            \
}
class Datum {
    void nostring() const
        { if (type == strng) throw SIException(); }
    void sametype(const Datum& d) const
        { if (type != d.type) throw SIException(); }
public:
    DatumType type;
    int value;          // number value
    std::string strval; // string value
    Datum() : type(unknown), value(0)
        {  }
    explicit Datum(int val) : type(number), value(val)
        {  }
    explicit Datum(std::string str) : type(strng), value(0), strval(str)
        {  }
    Datum& operator=(const Datum& d)
        { type = d.type; value = d.value; strval = d.strval; return *this; }
    Datum operator+(const Datum& d) const
    {
        sametype(d);
        if (type == strng)
            return Datum(strval + d.strval);    // concatenate strings
        return Datum(value + d.value);          // sum numbers
    }
    bool operator!() const
    {
        nostring();
        return !value;
    }
    UNARY(-)
    ARITHMETIC(*)
    ARITHMETIC(/)
    ARITHMETIC(-)
    RELATIONAL(<=)
    RELATIONAL(>=)
    RELATIONAL(!=)
    RELATIONAL(==)
    RELATIONAL(<)
    RELATIONAL(>)
    LOGICAL(&&)
    LOGICAL(||)
};
class Token {
public:
    token tok;
    Datum datum;
int tokennumber;
    Token(token t = 0) : tok(t)
        {  }
    bool operator<(const Token& t) const
        { return tok < t.tok; }
    bool operator==(const Token& t) const
        { return tok == t.tok; }
    Token& operator=(const Token& t)
        { tok = t.tok; datum = t.datum; return *this; }  
};
typedef std::vector<Token>          token_buffer;
typedef token_buffer::iterator      token_iter;
enum SymbolType { none, variable, ifunction, pfunction };
class Symbol {
public:
    SymbolType type;
    std::string name;
    Datum datum;
    int entry;          // subscript to function's first entry in token buffer
    ifunc fn;           // points to intrinsic function
    Symbol(SymbolType ty = none, const std::string nm = std::string() ) : 
            type(ty), name(nm), entry(0), fn(0)
        {  }
    bool operator<(const Symbol& s) const
        { return name < s.name; }
    bool operator==(const Symbol& s) const
        { return name == s.name; }
    Symbol& operator=(const Symbol& s)
        { type = s.type; name = s.name; datum = s.datum; 
                             entry = s.entry; fn = s.fn; return *this; }
};
typedef std::vector<Symbol>         symbol_table;
typedef symbol_table::iterator      symbol_iter;
class SInterpreter  {
    token_iter tokiter; // iterates the token buffer during interpreting
private:
    class Keyword {
    public:
        std::string kw;
        Token kwtoken;
        Keyword(const char* k, Token tk) : kw(k), kwtoken(tk)
            {  }
    };
    symbol_table    symboltable;
    token_buffer    tokens;
    int currentscope;   // index of first symbol table entry for current scope
    static token tokentbl[];
    static Keyword keywords[];
    Datum frtn;         // return value from a function     
    bool breaking, returning;
    int skipping;
    int linenumber;
    bool scanned;       // true when lexical scan is complete
    int LineNumber();   // current source file line number  
    void initialize();  // initialize data variables
    // functions for lexical scan
    void lexicalscan();
    bool declarator(bool islocal, bool isparameter = false);
    void declarators(bool islocal);
    Token compilenextsourcetoken();
    int escseq();
    int getsourcechar();
    int getrawsourcechar();

    // functions for compiling and interpreting program
    Token nexttoken();
    void prevtoken();
    Token needtoken(token tkn);
    Datum function(Symbol sym);
    bool findsymbol(int& ndx, const std::string& name, int fromscope = 0);
    void compound_statement(int scope);
    void statement();
    void outofscope();

    void statements();
    void skip_statements();

    bool istoken(token tkn);
    void skippair(token ltkn, token rtkn);

    Datum primary();
    Datum mult();
    Datum plus();
    Datum le();
    Datum eq();
    Datum and();
    Datum expression();

    bool isidentchar(int c)
    {
        return isalpha(c) || isdigit(c) || c == '_';
    }
    bool iswhite(int c)
    {
        return c == ' ' || c == '\t';
    }
public:
    explicit SInterpreter(const Intrinsic* inf);
    int interpret();
};
// } // namespace DDJScriptInterpreter
// ------ functions provided by the shell
int getsource();
void ungetsource(int ch);

Back to Article