The authors are researchers at the IBM T.J. Watson Research Center and can be contacted at P.O. Box 704, Yorktown Heights, New York, NY 10598 or at jrhyne@ibm.com. Note: Parts of this article were presented at the X-Window technical conference earlier this year.
About four years ago, we began working on enhancements to the X-Window system to provide a stylus-based user interface for handheld computers. This article focuses on those X11 extensions, specifically those that support stylus-driven applications.
We use the term PaperLike Interface (PLI) to distinguish the emerging generation of notepad computers from those machines that rely on keyboard and mouse interaction. Our group has been researching the technology associated with this new class of machines, and we've built several prototypes that run on AIX and X11.
The specifications for our research machines are a moving target, but our goal is to build a machine with a 640x 480 display (16 gray levels), under 6 pounds, and comparable to a 32-bit personal computer in speed and storage. Currently, the system software for our prototype consists of AIX and a modified X11, Release 4. The operating system includes TCP/IP, sockets, and NFS, and it is quite feasible to run large, compute-intensive applications on a host machine while running an X server on the notepad prototype.
The distributed nature of X applications is vital to our development plans. One of our sample applications is a cooperative meeting application in which several networked users draw on a shared drawing surface (single client, multiple servers). When we acquire wireless LAN capability early next year, the distributed computing model will be even more important.
The software architecture of the system is partitioned into three areas: application, server, and kernel. X systems use a distributed architecture, with multiple client-side applications communicating over a channel (which can be a local area network) with one or more X servers that provide graphics display and input event handling services. The kernel is the component in which hardware dependencies such as device drivers are contained.
The application layer is itself subdivided into four layers. At the topmost level is code that is purely application specific. This code calls on services provided by the next lower layer, the OSF/Motif widget set. (Widgets are user interface components such as dialogs, list boxes and text-edit fields.) The third level down is the so-called intrinsics layer (Xt) of the X11 toolkit, and finally there is the Xlib library of primitives that implement the X client/server protocol.
Implementing our PLI system required modifications to all these areas of the system.
PLI applications are built using an extended version of the OSF/Motif widget set. We've added new widgets to this set, and these widgets connect with an X11 server that has been modified to support an extended protocol. Dispatching of stroke events to widgets required the modification of the Xt of the X11 toolkit. And, of course, supporting stylus-oriented interaction required modifications to digitizer device drivers.
We'll describe the modifications to each of these layers, in turn, starting with the OSF/Motif widgets.
A principal new widget we created is called the WritingArea widget. This widget receives strokes from the server and invokes application-supplied callback functions. It is subclassed from the Motif DrawingArea widget and uses that widget's exposure callback and other resources.
The WritingArea widget is basically a primitive stroke-receiver widget combined with replaceable behavior modules invoked as callbacks. Callbacks are provided for stroke receipt and acceptance, for stroke processing, and for exposure events. The stroke receipt callback decides whether to accept or reject the stroke. If the stroke is accepted, the stroke processing callback is invoked with the array of coordinates comprising the stroke. The exposure callback is invoked whenever the server determines that part of the widget's window needs to be redisplayed by the application. The widget maintains a list of active strokes and redisplays them after the application's exposure callback has completed. This widget may also be configured so that it does not store or display accepted strokes. This configuration is useful for applications which will store the strokes and redisplay them during exposure callback processing.
Exposure event processing follows a typical sequence in which application graphics are generated, followed by the display of writing baselines when appropriate. Library procedures are provided for baseline generation and accommodate user-specific parameters such as horizontal and vertical spacing and the presence or absence of horizontal segmentation guides, such as ticmarks.
The WritingReco widget provides resources for configuring the recognizers, in addition to those provided by the WritingArea widget. The WritingReco widget relies on services provided by the Recognition/Presentation Toolkit, which is currently being constructed. The various components are shown in Figure 1.
The Recognition/Presentation toolkit supplies the callback routines needed by the WritingReco widget. It also simplifies the programming interface to the recognizers, by providing a consistent user interface to recognition-related services such as error correction, prototype, and recognizer management. In addition, it provides a library of reusable functions for recognition and recognition-related services which would otherwise have to be written by each application developer.
In the PLI interface, the error correction paradigm is such that the user selects an erroneous displayed symbol by touching it with the pen, to replace it with the correct symbol, and to correct the recognizer. Error correction is therefore a special mode in which the toolkit receives and interprets strokes, rather than passing them to the recognizers and the application.
A possible design for error correction has an error correction button placed on the title line of the window border. Touching this button places the toolkit in error correction mode. When the user touches a displayed symbol, the touch stroke location is used to select the corresponding symbol from the recognition results.
One of the possible error correction styles is activated; for example, the next symbol from the set of possibilities might be displayed. The user exits the error correction mode by again touching the error correction button. The application designer or user selects an error correction style for each of the application's recognition objects by defining resource values in the usual way.
Other functions, such as adjustment of recognition parameters or training to introduce a new symbol, are accessed by touching another button in the title bar, then touching anywhere in a WritingArea widget's window. A pop-down menu appears, from which the user selects the desired function. Subsequently, a recognizer control panel may appear, or a training window. When the user dismisses these windows, the toolkit exits the special mode and the application resumes normal behavior.
Implementation of these functions is complicated because an application main window may contain several WritingReco widgets. Each one is associated with an instance of the recognition object which contains recent recognition results, strokes, and result display regions, as well as the parameters for recognizing strokes received in the widget's window.
A form for data entry, for example, may be composed of several WritingReco widgets and their associated recognition toolkit instances. A particular widget/toolkit pair might select a recognition vocabulary of numbers, if only entry of numbers is allowed. This sort of restriction is valuable because recognition accuracy and speed are improved, and the user is alerted to entry errors by the display of special symbols where the recognizer is unable to find a suitable match. For example, an "A" entered by the user in a numeric entry field might appear displayed as a "?".
Touching one of the recognition function buttons causes a global variable to be set, which is checked by each recognition object. A stroke received while the variable is set will be routed to the corresponding toolkit function rather than being sent for recognition.
The X11R4 protocol extension for PLI consists of a stroke event and seven requests.
The stroke event has several subcases identified by the detail byte. These subcases include; the start of a stroke, motion during a stroke, the end of a stroke, and proximity (which occurs when the pen position is detectable but the pen is not touching the display surface).
To help the application determine whether to accept the stroke or request the stroke path, the stroke event contains the starting and ending coordinates of the stroke and the maximum and minimum values for X and Y. It also contains a set of flags which indicate whether the start and end points are inside or outside. These flags were selected because the corresponding tests were frequently used in previous prototype applications to determine stroke acceptance.
The stroke event structure is of fixed size, and thus cannot contain the sequence of coordinates generated by the digitizer. To obtain these coordinates, an application makes a request which returns a variable-length data structure. This same request also converts the coordinates from the screen-relative form retained by the server to a window-relative form.
Using another kind of request, applications can accept or reject a stroke. The stroke event contains a server-generated ID used to identify the stroke to be accepted or rejected. The protocol requires that each stroke eventually be accepted or rejected by the applications that see it. When this condition is met, the server will erase the stroke ink and delete the stroke from its queue. The protocol allows strokes to be forced from the server queue, and this may be needed when a client hangs without accepting or rejecting some strokes. Strokes are automatically accepted for a client which dies; to reject them might lead to creation of unwanted pointer events.
Stroke replies contain scaled coordinates rather than pixel coordinates (see the discussion in the "Device Driver" section for details) and cannot be drawn using the XDrawLine library function. To simplify application programming, the extension provides an XDrawStroke function and protocol request with similar parameters. The server converts the stroke coordinates and invokes the line-drawing procedure.
There is also a request which allows a client to request realignment of the digitizer and the display. The client that performs the function is typically invoked from the window manager's menu.
Another similar request allows a client to set the pointer button being emulated by the stylus. This is not set from the window manager menu, but from a small icon permanently displayed on the screen. There is a request to enqueue a stroke, which is used to help debug the server and the toolkits. Finally, applications can query the server for details about the display and digitizer capabilities by using yet another request.
The stroke processing functions of the X11 server have been grouped into a server extension, with a corresponding extension to the X11 protocol. The design of these functions is somewhat surprising, as a stylus is neither a keyboard nor a mouse, but may be called upon to emulate either.
Experiments with our early prototypes led to the following observations:
We observed that users tended to work in a particular window, and this suggested routing strokes to a particular window until that window's application rejected a stroke. When the server receives a stroke rejection, it selects another candidate window for the rejected stroke and all that follow it. This routing scheme permits an application to capture handwriting which runs outside of window boundaries. It also permits an application to recognize a stroke before deciding whether to reject or accept it. However, this algorithm has the property that a misbehaving client can cause all strokes to be routed to it and defeat pointer emulation. When this happens, the server becomes useless until the client is killed by some external means (such as telneting in from another workstation).
Alternative solutions considered were: moving the recognition function to the server and using recognition results to assist in the routing decision, or routing strokes to all windows at the same time and letting them decide whether to accept or reject the stroke. Moving the recognizer seemed infeasible because each application requires a distinct symbol set and applies differing criteria to weight-recognition results. In addition, the interface to the recognition software is quite complicated. We may revisit this decision in the future, as we better understand the requirements for recognition and its software architecture. At first glance, routing strokes to all clients at the same time seems an invitation to chaos. However, applications may be designed with this behavior in mind and should agree on a unique recipient virtually all of the time.
There are several cases to consider:
In the second case, the stroke lies entirely within the window, so there is only one routing candidate.
In the third case, in which the stroke is partly outside the stroke window, there are two variations, depending on whether the other candidate window is a stroke or nonstroke window.
If it is a stroke window, the acceptance/rejection test is based on where the salient point of the gesture or character falls. The stroke is recognized by the primary application and its salient point falls inside the window, so the application accepts the stroke. The other application may also recognize the stroke, but finds that the salient point falls outside the visible region of its window and so rejects the stroke. If the other application is not performing recognition, then it should reject any stroke which lies partially outside the visible region of its window. If neither window is performing recognition, both will reject the stroke and it will disappear. Hopefully the user will find this response to be reasonably intuitive, and will then make the stroke again within the proper boundaries.
If the stroke falls partly outside the stroke window onto a nonstroke window, the stroke is not turned into a pointer event unless there are no stroke candidates, or all stroke candidates have rejected the stroke. Therefore, the stroke window will see the stroke events, but the nonstroke window will not see pointer events unless the stroke window rejects the stroke. A misbehaving stroke application can prevent a stroke that enters its window from being turned into pointer events. The user can make the stroke again, avoiding the window of the misbehaving application, if pointer emulation was intended. The stroke remains on the display until all candidates have accepted or rejected it. The user expected the stroke to disappear (as a result of pointer emulation), and its failure to disappear is a clue that an application is misbehaving.
The fourth and last case is one in which the user drags the stylus as if it were a pointer. This case is difficult because the pointer emulation decision must occur at the start of the stroke. In the meantime, the motion of the stylus may cross several windows (which can be either stroke or nonstroke windows).
What will likely trouble the user is that the drag echo won't occur until the user has lifted the stylus; this not what is expected.
Special handling is necessary here. If the start of the stroke lies in a nonstroke window, and the stylus remains relatively stationary for a brief period (for example, 100 msec), then the stroke is converted to a series of pointer events and never routed as a stroke. Most users performing a drag quickly discover that the button-down event appeared at the wrong position, and they have missed the target they were trying to hit. This behavior is especially pronounced when trying to drag a window border to resize it, because of the narrowness of the borders. The mouse is held essentially still during this wait time (and so is the stylus).
X11 allows applications to indicate interest in getting reports of various kinds of events which occur in each of their windows. We extended this mechanism to stroke events, and used it to trigger pointer emulation. If a window is tagged for pointer events, but not for stroke events, then a stroke which would be routed to this window is converted into pointer events.
The conversion is a natural one: The stroke start becomes a button-down, the stroke end becomes a button-up, and the intermediate reports become pointer motions. The stylus thus naturally mimics the mouse, and experienced mouse users rarely make mistakes in employing the stylus. The stylus leaves an ink trail in this mode and although this is initially noticeable, for instance while moving or resizing a window, it does not impede the user and none of our subjects has asked us to eliminate it. The server deinks strokes as soon as it determines that pointer emulation is active, and the ink is usually gone within a fraction of a second.
We currently provide multiple-button support via a small icon which the user may touch to select the button being emulated. This provides the needed function, but encourages frequent user errors because users forget to restore the original button setting.
This kernel component manages the hardware interface to the digitizer, generates ink on the display, and provides a standard interface to the X11 server. Anticipating frequent changes to digitizer and display hardware as well as the need to support several operating systems, we constructed the PLI driver in three parts:
The device driver is opened by the server. Digitizer reports are then read as a character stream. The application can be notified when data is available; in AIX the select system call is used. The server may control the behavior of the device driver by writing to it. If supported by the operating system, the device driver may place its data directly in a circular buffer accessible to the application, to avoid the system call overhead and double copying of the data.
When the pen touches the writing surface, the device driver begins to report a stream of coordinates to the server. At the same time, the device driver is generating an ink trace on the display. The stream of coordinates from pen-down to pen-up is called a stroke, and is the primary data unit reported by the device driver. To avoid excessive overhead, the device driver buffers the coordinate stream and occasionally indicates, via select, that data is available for the server. Our current digitizer provides position reports even when the stylus is a small distance above the surface. The device driver does not buffer this data, but periodically reports the current position.
Inking is done in the device driver to provide realtime feedback. The X11 server runs as a single threaded application process and cannot guarantee realtime attention to the device driver. The device driver saves the critical display state, performs its inking, and restores the display state; thus, it can time-share the display with the X11 server. Unfortunately, ot all displays are designed so that the state can be saved and restored, and in this case, the X11 server will need to be extensively modified to provide a separate inking thread with locks to control sharing of the display. The server will erase the ink, which eliminates the need for the device driver to buffer potentially large amounts of data in its memory.
The Bresenham line algorithm is used to connect successive digitizer points while the stylus switch is depressed. Because of the high sampling rate of the digitizer, the stylus rarely moves more than one or two pixel positions on the display between samples. The inking process is invoked only when the stylus has moved more than one pixel from the previous sample.
Ink is generated on one of the four planes of the display. The server may freely use the other three, planes providing eight grey levels. The ink plane is combined with the display plane using XOR implemented in the display color map. Other ink-combining functions are possible, but preserving the contrast between ink and application graphics is critical.
There are three coordinates systems to contend with: digitizer coordinates, display screen coordinates, and window-relative coordinates.
The digitizer resolution is typically 2 to 16 times greater than the display resolution, and the digitizer resolution must be preserved for accurate recognition. To generate the ink trace, coordinates must be converted to display screen units. Furthermore, the server and applications want to see stroke information relative to the display screen or to windows on the display screen, and not in some coordinate system provided by the digitizer manufacturer.
The device driver addresses these issues by returning scaled screen coordinates which have been multiplied by a factor of 2, 4, 8, or 16. The subpixel resolution of the digitizer is preserved, and the conversion back to integral pixel coordinates can be done with a right shift.
The device driver uses a simple linear model to convert the digitizer coordinates to scaled display coordinates:
x'=ax+by+c y'=dx+ey+f
The linear model requires eight parameters and compensates for scale, translation, and rotation between the digitizer and the display coordinate systems introduced when the display and digitizer are joined together. The computation uses integer arithmetic, because floating-point services are not usually available to device drivers.
The coefficients a through f are prescaled to prevent loss of significance during the computation. The resulting coordinates are pixel values scaled to preserve the dynamic range of the digitizer. Currently, we use a scaling factor of 2{2}.
The eight parameters must be provided by the server, and are written to the device driver during its initialization. Generally, the parameters are obtained by displaying a crosshair at three locations on the display and asking the user to touch each crosshair. The crosshair coordinates and the averaged digitizer coordinates fully determine six parameters of the conversion function. The other two parameters are fixed at design time by the dynamic range of the digitizer and the resolution ratio between the digitizer and the display. One writes a command to the device driver to turn off the inking and set up the unity conversion function, and the driver subsequently reports the raw digitizer coordinates. After the six parameters are computed, they are written to the device driver and inking is restored.
This calibration procedure also compensates for visual parallax. Rather than calibrate the driver once during initialization, we permit the user to recalibrate at will as a way to compensate for periodic changes in viewing position.
The device driver also timestamps the beginning and end of each stroke. In our system, these timestamps are accurate to one sixtieth of a second. The primary use for the timestamp is to detect unintended breaks in a stroke. It is physically difficult for a user to lift and lower the pen in less than 0.07 seconds, so when an application sees a stroke ending and a new one beginning in an interval smaller than that, it may concatenate the two strokes and interpolate the missing data values.
The device-driver interface is further complicated by the possibility of internal buffer overflow. Internal buffer overflow causes immediate cessation of inking to alert the user that something is wrong. The X11 server receives a status report that the stroke ended prematurely; typically, it will discard the stroke as we have found that users tend to lift the pen when the ink ceases and will repeat the stroke when its visible part has been erased. All the inked coordinates are reported, so that the server can erase them.
The policy of the MIT X Consortium to distribute sample source code for X11R4 has greatly facilitated our work. Other proprietary window systems would not have permitted the kinds of modifications necessary to support stylus interaction for a PaperLike Interface.
We have recently contributed a preliminary X11R5 implementation of the PLI for the IBM RISC System/6000 to the MIT X Consortium. The code is available via anonymous FTP from MIT. The future of PLI is potentially a bright one. We hope that others will join us in exploring and developing this technology, and that computing users will find it fun and effective.
Card, S.K., T.P. Morgan, and A. Newell. The Psychology of Human-Computer Interaction. Lawrence Earlbaum Associates, 1983.
Wolf, C.G. "A Comparative Study of Gestural and Keyboard Interfaces." Proceedings of the Human Factors Society 32nd Annual Meeting, 1988.
Wolf, C.G. and J.R. Rhyne, "A Taxonomic Approach to Understanding Direct Manipulation." Proceedings of the Human Factors Society 31st Annual Meeting, 1987.
Copyright © 1991, Dr. Dobb's Journal