ENHANCING THE X-WINDOW SYSTEM

Adding a paperlike interface and handwriting recognition

The Recognition/Presentation toolkit supplies the callback routines needed by the WritingReco widget. It also simplifies the programming interface to the recognizers, by providing a consistent user interface to recognition-related services such as error correction, prototype, and recognizer management. In addition, it provides a library of reusable functions for recognition and recognition-related services which would otherwise have to be written by each application developer.

Toolkit Support for Error Correction

In the PLI interface, the error correction paradigm is such that the user selects an erroneous displayed symbol by touching it with the pen, to replace it with the correct symbol, and to correct the recognizer. Error correction is therefore a special mode in which the toolkit receives and interprets strokes, rather than passing them to the recognizers and the application.

A possible design for error correction has an error correction button placed on the title line of the window border. Touching this button places the toolkit in error correction mode. When the user touches a displayed symbol, the touch stroke location is used to select the corresponding symbol from the recognition results.

One of the possible error correction styles is activated; for example, the next symbol from the set of possibilities might be displayed. The user exits the error correction mode by again touching the error correction button. The application designer or user selects an error correction style for each of the application's recognition objects by defining resource values in the usual way.

Other functions, such as adjustment of recognition parameters or training to introduce a new symbol, are accessed by touching another button in the title bar, then touching anywhere in a WritingArea widget's window. A pop-down menu appears, from which the user selects the desired function. Subsequently, a recognizer control panel may appear, or a training window. When the user dismisses these windows, the toolkit exits the special mode and the application resumes normal behavior.

Implementation of these functions is complicated because an application main window may contain several WritingReco widgets. Each one is associated with an instance of the recognition object which contains recent recognition results, strokes, and result display regions, as well as the parameters for recognizing strokes received in the widget's window.

A form for data entry, for example, may be composed of several WritingReco widgets and their associated recognition toolkit instances. A particular widget/toolkit pair might select a recognition vocabulary of numbers, if only entry of numbers is allowed. This sort of restriction is valuable because recognition accuracy and speed are improved, and the user is alerted to entry errors by the display of special symbols where the recognizer is unable to find a suitable match. For example, an "A" entered by the user in a numeric entry field might appear displayed as a "?".

Touching one of the recognition function buttons causes a global variable to be set, which is checked by each recognition object. A stroke received while the variable is set will be routed to the corresponding toolkit function rather than being sent for recognition.

Extensions to the X Protocol

The X11R4 protocol extension for PLI consists of a stroke event and seven requests.

The stroke event has several subcases identified by the detail byte. These subcases include; the start of a stroke, motion during a stroke, the end of a stroke, and proximity (which occurs when the pen position is detectable but the pen is not touching the display surface).

To help the application determine whether to accept the stroke or request the stroke path, the stroke event contains the starting and ending coordinates of the stroke and the maximum and minimum values for X and Y. It also contains a set of flags which indicate whether the start and end points are inside or outside. These flags were selected because the corresponding tests were frequently used in previous prototype applications to determine stroke acceptance.

The stroke event structure is of fixed size, and thus cannot contain the sequence of coordinates generated by the digitizer. To obtain these coordinates, an application makes a request which returns a variable-length data structure. This same request also converts the coordinates from the screen-relative form retained by the server to a window-relative form.

Using another kind of request, applications can accept or reject a stroke. The stroke event contains a server-generated ID used to identify the stroke to be accepted or rejected. The protocol requires that each stroke eventually be accepted or rejected by the applications that see it. When this condition is met, the server will erase the stroke ink and delete the stroke from its queue. The protocol allows strokes to be forced from the server queue, and this may be needed when a client hangs without accepting or rejecting some strokes. Strokes are automatically accepted for a client which dies; to reject them might lead to creation of unwanted pointer events.

Stroke replies contain scaled coordinates rather than pixel coordinates (see the discussion in the "Device Driver" section for details) and cannot be drawn using the XDrawLine library function. To simplify application programming, the extension provides an XDrawStroke function and protocol request with similar parameters. The server converts the stroke coordinates and invokes the line-drawing procedure.

There is also a request which allows a client to request realignment of the digitizer and the display. The client that performs the function is typically invoked from the window manager's menu.

Stroke Routing and Pointer Emulation

The stroke processing functions of the X11 server have been grouped into a server extension, with a corresponding extension to the X11 protocol. The design of these functions is somewhat surprising, as a stylus is neither a keyboard nor a mouse, but may be called upon to emulate either.

Experiments with our early prototypes led to the following observations:

A consistent method of pointer emulation is required, so that existing applications could be driven by the stylus.

A stroke may extend across several windows, and only the applications can determine whether a stroke is acceptable in one of their windows.

Recognition of strokes will differ from window to window, precluding a simple strategy of recognizing strokes prior to dispatching them to applications.

Because a handwritten character is two to three times larger than a presentation font symbol, a user will often wish to continue a string of handwritten characters beyond the boundary of the window in which the string began.

Unlike a pointer event, which has a single point of interest, a stroke ranges over an area of the screen. What point in the stroke should determine the window routing? For many gestures, it will be the start of the stroke. But for others such as the arrowhead, the salient point will lie at a point along the stroke that is found by the recognizer. In the case of the arrowhead, the natural salient point is its tip.

We observed that users tended to work in a particular window, and this suggested routing strokes to a particular window until that window's application rejected a stroke. When the server receives a stroke rejection, it selects another candidate window for the rejected stroke and all that follow it. This routing scheme permits an application to capture handwriting which runs outside of window boundaries. It also permits an application to recognize a stroke before deciding whether to reject or accept it. However, this algorithm has the property that a misbehaving client can cause all strokes to be routed to it and defeat pointer emulation. When this happens, the server becomes useless until the client is killed by some external means (such as telneting in from another workstation).

Alternative solutions considered were: moving the recognition function to the server and using recognition results to assist in the routing decision, or routing strokes to all windows at the same time and letting them decide whether to accept or reject the stroke. Moving the recognizer seemed infeasible because each application requires a distinct symbol set and applies differing criteria to weight-recognition results. In addition, the interface to the recognition software is quite complicated. We may revisit this decision in the future, as we better understand the requirements for recognition and its software architecture. At first glance, routing strokes to all clients at the same time seems an invitation to chaos. However, applications may be designed with this behavior in mind and should agree on a unique recipient virtually all of the time.

There are several cases to consider:

The user clicks within a nonstroke window.
The user makes a stroke within a stroke window.
The user makes a gesture (for example, a caret) which is partly outside the stroke window.
The user makes a pointer drag interaction.

In the first case the stroke has no potential stroke routing candidates because it is entirely within the nonstroke window. As soon as the end of the stroke is seen, the server turns the entire stroke into a pointer event. Because the stroke duration is short, the user never notices that the pointer emulation decision occurs at the end of the stroke.

In the second case, the stroke lies entirely within the window, so there is only one routing candidate.

In the third case, in which the stroke is partly outside the stroke window, there are two variations, depending on whether the other candidate window is a stroke or nonstroke window.

If it is a stroke window, the acceptance/rejection test is based on where the salient point of the gesture or character falls. The stroke is recognized by the primary application and its salient point falls inside the window, so the application accepts the stroke. The other application may also recognize the stroke, but finds that the salient point falls outside the visible region of its window and so rejects the stroke. If the other application is not performing recognition, then it should reject any stroke which lies partially outside the visible region of its window. If neither window is performing recognition, both will reject the stroke and it will disappear. Hopefully the user will find this response to be reasonably intuitive, and will then make the stroke again within the proper boundaries.

If the stroke falls partly outside the stroke window onto a nonstroke window, the stroke is not turned into a pointer event unless there are no stroke candidates, or all stroke candidates have rejected the stroke. Therefore, the stroke window will see the stroke events, but the nonstroke window will not see pointer events unless the stroke window rejects the stroke. A misbehaving stroke application can prevent a stroke that enters its window from being turned into pointer events. The user can make the stroke again, avoiding the window of the misbehaving application, if pointer emulation was intended. The stroke remains on the display until all candidates have accepted or rejected it. The user expected the stroke to disappear (as a result of pointer emulation), and its failure to disappear is a clue that an application is misbehaving.

The fourth and last case is one in which the user drags the stylus as if it were a pointer. This case is difficult because the pointer emulation decision must occur at the start of the stroke. In the meantime, the motion of the stylus may cross several windows (which can be either stroke or nonstroke windows).

What will likely trouble the user is that the drag echo won't occur until the user has lifted the stylus; this not what is expected.

Special handling is necessary here. If the start of the stroke lies in a nonstroke window, and the stylus remains relatively stationary for a brief period (for example, 100 msec), then the stroke is converted to a series of pointer events and never routed as a stroke. Most users performing a drag quickly discover that the button-down event appeared at the wrong position, and they have missed the target they were trying to hit. This behavior is especially pronounced when trying to drag a window border to resize it, because of the narrowness of the borders. The mouse is held essentially still during this wait time (and so is the stylus).

Enabling Windows for Stroke Routing

X11 allows applications to indicate interest in getting reports of various kinds of events which occur in each of their windows. We extended this mechanism to stroke events, and used it to trigger pointer emulation. If a window is tagged for pointer events, but not for stroke events, then a stroke which would be routed to this window is converted into pointer events.

The conversion is a natural one: The stroke start becomes a button-down, the stroke end becomes a button-up, and the intermediate reports become pointer motions. The stylus thus naturally mimics the mouse, and experienced mouse users rarely make mistakes in employing the stylus. The stylus leaves an ink trail in this mode and although this is initially noticeable, for instance while moving or resizing a window, it does not impede the user and none of our subjects has asked us to eliminate it. The server deinks strokes as soon as it determines that pointer emulation is active, and the ink is usually gone within a fraction of a second.

We currently provide multiple-button support via a small icon which the user may touch to select the button being emulated. This provides the needed function, but encourages frequent user errors because users forget to restore the original button setting.

The PLI Device Driver

This kernel component manages the hardware interface to the digitizer, generates ink on the display, and provides a standard interface to the X11 server. Anticipating frequent changes to digitizer and display hardware as well as the need to support several operating systems, we constructed the PLI driver in three parts:

The digitizer driver, which handles the digitizer and its hardware connection.

The display driver, which initializes the display and provides inking and deinking functions.

The OS driver, which interfaces the other parts to the operating system, and transfers data to and from the application.

Standard interfaces are defined between these components, making it possible to support a new digitizer by replacing just the digitizer driver. Encapsulating the OS functions has resulted in extra procedure calls in the device driver, but the execution time penalty is small and the driver portability greatly improved. The existence of three IBM operating systems for the IBM PS/2 (DOS, OS/2, and AIX) makes portable software quite valuable.

The device driver is opened by the server. Digitizer reports are then read as a character stream. The application can be notified when data is available; in AIX the select system call is used. The server may control the behavior of the device driver by writing to it. If supported by the operating system, the device driver may place its data directly in a circular buffer accessible to the application, to avoid the system call overhead and double copying of the data.

When the pen touches the writing surface, the device driver begins to report a stream of coordinates to the server. At the same time, the device driver is generating an ink trace on the display. The stream of coordinates from pen-down to pen-up is called a stroke, and is the primary data unit reported by the device driver. To avoid excessive overhead, the device driver buffers the coordinate stream and occasionally indicates, via select, that data is available for the server. Our current digitizer provides position reports even when the stylus is a small distance above the surface. The device driver does not buffer this data, but periodically reports the current position.

Inking is done in the device driver to provide realtime feedback. The X11 server runs as a single threaded application process and cannot guarantee realtime attention to the device driver. The device driver saves the critical display state, performs its inking, and restores the display state; thus, it can time-share the display with the X11 server. Unfortunately, ot all displays are designed so that the state can be saved and restored, and in this case, the X11 server will need to be extensively modified to provide a separate inking thread with locks to control sharing of the display. The server will erase the ink, which eliminates the need for the device driver to buffer potentially large amounts of data in its memory.

The Bresenham line algorithm is used to connect successive digitizer points while the stylus switch is depressed. Because of the high sampling rate of the digitizer, the stylus rarely moves more than one or two pixel positions on the display between samples. The inking process is invoked only when the stylus has moved more than one pixel from the previous sample.

Ink is generated on one of the four planes of the display. The server may freely use the other three, planes providing eight grey levels. The ink plane is combined with the display plane using XOR implemented in the display color map. Other ink-combining functions are possible, but preserving the contrast between ink and application graphics is critical.

Coordinate Transformation

There are three coordinates systems to contend with: digitizer coordinates, display screen coordinates, and window-relative coordinates.

The digitizer resolution is typically 2 to 16 times greater than the display resolution, and the digitizer resolution must be preserved for accurate recognition. To generate the ink trace, coordinates must be converted to display screen units. Furthermore, the server and applications want to see stroke information relative to the display screen or to windows on the display screen, and not in some coordinate system provided by the digitizer manufacturer.

The device driver addresses these issues by returning scaled screen coordinates which have been multiplied by a factor of 2, 4, 8, or 16. The subpixel resolution of the digitizer is preserved, and the conversion back to integral pixel coordinates can be done with a right shift.

The device driver uses a simple linear model to convert the digitizer coordinates to scaled display coordinates:

  x'=ax+by+c
  y'=dx+ey+f

The linear model requires eight parameters and compensates for scale, translation, and rotation between the digitizer and the display coordinate systems introduced when the display and digitizer are joined together. The computation uses integer arithmetic, because floating-point services are not usually available to device drivers.

The coefficients a through f are prescaled to prevent loss of significance during the computation. The resulting coordinates are pixel values scaled to preserve the dynamic range of the digitizer. Currently, we use a scaling factor of 2{2}.

The eight parameters must be provided by the server, and are written to the device driver during its initialization. Generally, the parameters are obtained by displaying a crosshair at three locations on the display and asking the user to touch each crosshair. The crosshair coordinates and the averaged digitizer coordinates fully determine six parameters of the conversion function. The other two parameters are fixed at design time by the dynamic range of the digitizer and the resolution ratio between the digitizer and the display. One writes a command to the device driver to turn off the inking and set up the unity conversion function, and the driver subsequently reports the raw digitizer coordinates. After the six parameters are computed, they are written to the device driver and inking is restored.

This calibration procedure also compensates for visual parallax. Rather than calibrate the driver once during initialization, we permit the user to recalibrate at will as a way to compensate for periodic changes in viewing position.

The device driver also timestamps the beginning and end of each stroke. In our system, these timestamps are accurate to one sixtieth of a second. The primary use for the timestamp is to detect unintended breaks in a stroke. It is physically difficult for a user to lift and lower the pen in less than 0.07 seconds, so when an application sees a stroke ending and a new one beginning in an interval smaller than that, it may concatenate the two strokes and interpolate the missing data values.

The device-driver interface is further complicated by the possibility of internal buffer overflow. Internal buffer overflow causes immediate cessation of inking to alert the user that something is wrong. The X11 server receives a status report that the stroke ended prematurely; typically, it will discard the stroke as we have found that users tend to lift the pen when the ink ceases and will repeat the stroke when its visible part has been erased. All the inked coordinates are reported, so that the server can erase them.

Conclusion

The policy of the MIT X Consortium to distribute sample source code for X11R4 has greatly facilitated our work. Other proprietary window systems would not have permitted the kinds of modifications necessary to support stylus interaction for a PaperLike Interface.

We have recently contributed a preliminary X11R5 implementation of the PLI for the IBM RISC System/6000 to the MIT X Consortium. The code is available via anonymous FTP from MIT. The future of PLI is potentially a bright one. We hope that others will join us in exploring and developing this technology, and that computing users will find it fun and effective.

References

Card, S.K., T.P. Morgan, and A. Newell. The Psychology of Human-Computer Interaction. Lawrence Earlbaum Associates, 1983.

Wolf, C.G. "A Comparative Study of Gestural and Keyboard Interfaces." Proceedings of the Human Factors Society 32nd Annual Meeting, 1988.

Wolf, C.G. and J.R. Rhyne, "A Taxonomic Approach to Understanding Direct Manipulation." Proceedings of the Human Factors Society 31st Annual Meeting, 1987.