January 1993/The UUCP g Protocol

Features

The UUCP g Protocol

Ian Lance Taylor

Ian Lance Taylor took a course in COBOL at the Cambridge Rindge and Latin High School, from which he graduated in 1982. Since then he has written an ANSI C library for a comuter system which nobody has ever heard of, has ported the DECUS version of Zork from Fortran to C, has written a complete (UNIX only) UUCP package, and has read Knuth from cover to cover. He works for Cygnus Support, a company dedicated to supporting free software. He can be reached at ian@airs.com.

Introduction
The UUCP suite of programs is widely used in the UNIX world to transfer mail and news between computers. UUCP implementations are also available for many types of personal computers. This article is a detailed description of the original, and most frequently used, UUCP transport-layer protocol.
The basic UUCP service is unattended file transfer over telephone lines or a network connection. Other programs build on that base to provide remote program execution. The first UUCP implementation was written by Mike Lesk at AT&T in 1976. UUCP has been rewritten and enhanced several times since then, most notably by Peter Honeyman, David A. Nowitz, and Brian E. Redman in 1983, who wrote what is known as HoneyDanBer UUCP. I recently rewrote the program suite from scratch, in order to distribute it in sourcecode form under the GNU Public License. My implementation is modestly entitled Taylor UUCP.
UUCP is a point-to-point protocol, which means that it is used between two computers at a time. The protocol has two layers, a session layer and a transport layer. The two computers use the session layer to identify each other and to agree on which files should be transferred. The session layer relies on the transport layer to provide error-free data transfer across the communication link.
UUCP uses several different transport-layer protocols, which are optimized for different types of communication links. Not all UUCP implementations support every protocol. When two computers begin a UUCP session, they negotiate which transport layer protocol to use. The transport protocols are known by single character names. The most common ones are g, f, t, and e.
In the article I am going to discuss the g protocol. It is the original UUCP transport-layer protocol, and it is the only one supported by all UUCP implementations. It is based on a packet-driver protocol originally written by G.L. Chesson at Bell Labs. It is intended to work over unreliable connections, such as phone lines, but it requires an eight-bit transparent connection. If any special characters cannot be sent over the line, such as XON or XOFF, it will not work.
There is not enough room here to present the complete g protocol implementation used by Taylor UUCP, so this article merely shows the high points. Listing 1 shows a number of typedefs, macro definitions, variables, and prototypes that are used in the later code. The complete Taylor UUCP code is available for anonymous FTP from various sites, including ftp.uu.net.
Since the g protocol works over noisy phone lines, it has to be prepared for data to be modified, or even lost, as it is sent from one computer to the other. (Some types of network connections can also duplicate data. The g protocol does not always handle this correctly).
This protocol provides flow control. The computer sending data is permitted to send only a certain number of bytes to the receiving computer before it must wait for an acknowledgement. That prevents a fast computer from overwhelming a slower computer with more data than the latter can handle.
The g protocol communicates in packets. There are two types of packets, control packets and data packets. The header of a data packet tells the receiver how much data is present. It also has a checksum which is used to detect transmission errors.

Packet Header
All packets begin with a six-byte header. Control packets have no attached data, and are therefore exactly six bytes long. Data packets contain data following the header bytes. The header bytes are shown in Figure 1.
The first header byte is always the ASCII character DLE, which is octal 020 or control-P. If there is a transmission error, this byte can be used to locate the start of the next header.
The second header byte is known as K. For a control packet, the value is always 9. For a data packet, K indicates how many bytes of data follow the six-byte header. The amount of data is 2^K+4. The K value is always between 1 and 8, which means that the amount of data in a single packet can be any power of 2 between 32 (2¹⁺⁴) and 4,096 (2⁸⁺⁴).
The third and fourth header bytes form a two-byte (16-bit) checksum. The third byte is CHECKLOW and the fourth byte is CHECKHIGH. The full checksum is
(CHECKHIGH << 8) | CHECKLOW
The checksum differs for control and data packets, and is described further below.
The fifth header byte is the control byte. It is composed of three bit fields, known as TT (2 bits), XXX (3 bits), and YYY (3 bits). The value of the control byte is
(TT << 6) | (XXX << 3) | YYY
TT indicates the type of the packet:

0 — control packet

1 — alternate data channel

2 — data packet

3 — short data packet
The meanings of XXX and YYY depend upon the type of packet, and are described further below. The alternate-data-channel packet type (with a TT value of 1) is not used by UUCP.
The sixth and last header byte is a simple check on the validity of the header. It is the exclusive-OR of the second through fifth header bytes (the first byte is always DLE, and so does not have to be checked further). If the exclusive-OR check fails, the header data is invalid and is ignored.

Windows and Flow Control
Each data packet has a packet sequence number, which is the XXX value. The packet sequence number is always between 0 and 7. At the beginning of the session it starts at 1, goes up to 7, then wraps around to 0. The sequence numbers for incoming and outgoing packets are independent. After one computer sends out packet 7, the next packet it sends out is packet 0, regardless of how many intervening packets it has received.
The packet sequence number is used to detect data loss. If the packet following packet 5 is not packet 6, then information has been lost. Some protocols, such as the Internet TCP protocol, permit packets to arrive out of order, but the g protocol considers this an error. Control packets have no sequence number, so there is no way to detect a lost control packet.
The packet sequence number is also used for flow control. During protocol initialization each computer announces a window size. The window announced by one computer is the number of packets the other computer may send before it sees an acknowledgement.
There are two ways to send an acknowledgement. One is the YYY field of a data packet. The other is the YYY field of an RR control packet. (Control packets are described further below.) The two methods allow the acknowledgement to be either combined with data or not combined, depending on whether the acknowledging computer has any data to send. In either case, the acknowledgement gives the last packet number which was correctly received. Every packet must be acknowledged, and every packet must be acknowledged in order.
For example, suppose that computer A has announced a window size of 3, and that at some point during the conversation A has sent an acknowledgement for packet 6. This means that computer B may send 3 more packets — namely packets 7, 0, and 1 — but it may not send packet 2 until it has received an acknowledgement for packet 7. Note that the window size announced by A is a restriction on which packets B is permitted to send.
The window size may range from 1 to 7. A window size of 1 means that each packet must be acknowledged before the next one is sent. To see why the window size may not be 8, suppose that computer A has just sent packet 0, and received an acknowledgement for it. If the window size were 8, A could then send packets 1, 2, 3, 4, 5, 6, 7, and 0. If all eight packets were lost, B might time out, assume that its acknowledgement was lost, and acknowledge packet 0 again. When A saw the acknowledgement, it would not know whether the eight packets it sent were lost or whether B saw them all but all the acknowledgements but the last one were lost. With a maximum window size of 7, each packet acknowledgement is unambiguous.
The original UUCP implementation always used a window size of 3, and some implementations have followed its lead. Many newer implementations default to the maximum window size of 7.
The window size prevents a fast computer from overwhelming a slow computer. If there were no way to slow down the fast computer by forcing it to wait for an acknowledgement, it could eventually fill the input buffers of the slow computer and cause data to be lost. The problem would be detected, since there would be checksum failures, but the time required for the computers to get back in sync would slow the overall data transfer rate drastically.
Many systems rely on XON/XOFF handshaking for flow control. Data transmission stops when an XOFF character (control-S) is received, and starts again when an XON character (control-Q) is received. This does not work with the g protocol, which requires an eight-bit transparent communication line. (For example, one of the checksum bytes in the header might be XON or XOFF.)
Some modern error-correcting modems and serial ports can use various forms of hardware handshaking for flow control. This does not make the g protocol useless, since error detection is still necessary for the connection between the modem and the computer. In any case, a large window is relatively efficient even when it is not needed.

Data Packets
A data packet is indicated by a TT value of 2 or 3. The K value must be between 1 and 8, for a packet size between 32 and 4,096 bytes. This is the number of bytes which follow the header. For example, a header with a K value of 2 indicates a total packet size of six header bytes plus 64 data bytes for a total of 70 bytes.
During protocol initialization, each side announces the largest data packet it is prepared to receive. The original UUCP implementation always set the maximum data-packet size to just 64 bytes. Many later implementations have followed suit. While this small packet size was once reasonable, modern high-speed error-correcting modems are much more efficient when larger packet sizes are used.
Not every file contains a multiple of 32 bytes, which is the minimum packet size. Therefore, there has to be a way to transfer fewer bytes in a single packet. This is done with a short data packet, with a TT value of 3. The K value always indicates the physical number of bytes that are sent over the communication link. However, a short data packet contains a smaller number of logical bytes.
If the first data byte of a short data packet is less than 128, then that byte is subtracted from the physical packet length to get the logical packet length. The real data starts with the second byte. Otherwise a 15-bit value is formed with the low-order seven bits of the first byte and the complete second byte. The logical packet length is this value subtracted from the physical packet length, and the real data starts with the third byte. The logical length is given using subtraction because, if the logical length is only one byte less than the physical length, there is only a single byte available to specify the logical length.
The data-packet layout is designed so that the data can be manipulated as a block, without having to consider each individual byte (other than when computing the checksum). Some other communications protocols, such as Zmodem, use escape characters embedded within the data. This forces the code to examine each data byte and make a decision about it, which is less efficient.
Listing 2 shows the fgsenddata function, which sends out a data packet. The zdata argument points to the data, and the six bytes before zdata are available to hold the header (ensuring that these six bytes are available saves a call to memcpy in the common case). The cdata argument is the length of the data in zdata. If the packet size is larger than 64, then this code assumes that it is dealing with a newer UUCP implementation which will be able to handle different packet sizes within the same session.
Listing 2 shows a call to the igchecksum function, which is shown in Listing 3. Note that every byte sent with the data packet participates in the checksum, even though the values of some of the bytes may be unimportant in a short data packet.
The g protocol checksum is an ad hoc hashing scheme. It is the most time-consuming part of the entire protocol, and it is not even particularly good at detecting errors. A cyclic redundancy check would be more efficient to compute and would also catch more errors. For a brief but interesting discussion of this, see the book Design and Validation of Computer Protocols, by Gerard J. Holzmann. Note that after the checksum is computed it is manipulated still further before being placed in the packet header, as shown in Listing 2.

Control Packets
A control packet is indicated by a TT value of 0 and a K value of 9. The XXX value indicates the type of control packet:

0 — not used

1 — CLOSE

2 — RJ

3 — SRJ

4 — RR

5 — INITC

6 — INITB

7 — INITA
A CLOSE packet is sent when the conversation is finished, and the g protocol is shutting down. It is also sent if some fatal error is forcing one computer to drop the connection. The YYY value is ignored.
The RJ control is also known as NACK. It means that there was some problem with a received packet, such as a checksum error, and that data must be resent. The YYY value is the last successfully received packet. Every subsequent packet must be resent.
The SRJ control is actually not used by UUCP. Most UUCP programs will not recognize it. It is supposed to request retransmission of just packet YYY.
The RR control is also known as ACK. It is used to acknowledge a data packet without sending a data packet in return. (Acknowledgements were discussed above in the section on windows and flow control). The YYY value is the packet being acknowledged.
The INIT controls are used during protocol initialization. When the protocol starts up, each computer exchanges pairs of each type of INIT packet. The computer which placed the call sends an INITA, then the other computer responds with an INITA, and similarly for INITB and INITC. The YYY value for an INITA or INITC packet is the window size that the other computer should use. The YYY value for an INITB packet indicates the packet size that the other computer should use. This is encoded as one less than the K value for the desired packet size, so that it is a value from 0 to 7. The initialization exchange is not a negotiation. Each computer simply makes a request. Both computers may wind up using different window and packet sizes.
Listing 4 shows the code to send a control packet, and shows how the checksum value is computed.

Error Handling
Error handling is an important aspect of any communications protocol. The bulk of the Taylor UUCP implementation is devoted to it, although there is unfortunately not enough space to present the actual code. A protocol must be able to detect communication errors to ensure that it receives correct data. After an error occurs, the protocol must be able to quickly recover and continue transmitting information.
There is no way to ensure completely reliable data transfer across telephone lines. They can, in theory, arbitrarily corrupt data. In practice, however, telephone lines are fairly reliable. A fairly simple checksum is thus adequate to detect data corruption. The g protocol relies on checksums for error detection, and negative acknowledgements and timeouts for error recovery.
An error in a packet header can be detected by an invalid exclusive-OR byte or an invalid K value. After an error, the receiving computer must start looking for a new DLE byte. It must start looking at the second byte of the header. It may not skip the rest of the header, and it may certainly not skip the entire packet. If data were lost, then the start of the next packet might appear in the skipped data. Another DLE byte might be found that is actually part of the data in the corrupted packet, but the exclusive-OR byte should prevent it from being treated as a legitimate packet header.
An error in packet data is detected by an invalid checksum. If an invalid packet is seen, the receiving computer will send an RJ packet to tell the sending computer that the packet must be resent. The receiver must then start looking for a new DLE byte. Once again, it must not skip the packet data, since the checksum error might have been caused by lost data. The next packet header might be somewhere within the bytes which were assumed to be the data for the erroneous packet.
An RJ packet includes the sequence number of the last correctly-received packet. All subsequent packets must be resent. This is referred to as a go-back-N protocol. After an error, the sending computer must resend all the packets from a particular point, rather than just resend the erroneous packet. An SRJ packet could be used to request just the erroneous packet, but most UUCP implementations do not implement it. A go-back-N protocol can be less efficient than just resending a single packet, but it is easier to implement.
There are some concerns with a go-back-N protocol. If computer A is temporarily overloaded with work, it might not find time to process its input buffers, and data might be lost. When it notices the loss, it will send an RJ packet. Computer B is permitted to respond with an entire window full of data, but if it does computer A might simply get overloaded again. Taylor UUCP uses a simple slow-start algorithm, which temporarily shrinks B's sending window to a single packet and opens it up slowly as each acknowledgement is received. Part of this code can be seen at the end of Listing 2.
A lost packet will cause the next packet to be seen as an out-of-order packet, a case which requires a bit of thought. If A detects an invalid packet, it will send an RJ packet. It will then expect to see another copy of the invalid packet. However, B will probably have already sent the next few packets. It may even have sent an entire window full of packets. If A sends an RJ packet for each out-of-order packet, B will see a sequence of RJ packets all requesting the same invalid packet, which it will have to send multiple times. Taylor UUCP simply ignores out-of-order packets, and relies on a timeout to detect completely lost packets. More sophisticated approaches are possible.
Although phone lines cannot duplicate data, certain types of network connections can. This can confuse the g protocol. Suppose that the window size is 7, and that A sends out packets 4, 5, 6, 7, 0, 1, 2. After A received an acknowledgement for packet 4, it will send out packet 3. If packet 4 is duplicated, and the duplicate arrives at B just after packet 3 arrives, B will see the wrong data. This is a potential problem for any windowing protocol. It is generally avoided by making the window large enough to ensure that any duplicate packets will arrive before a complete window is sent. For example, the sequence number used by the Internet TCP protocol is a full 32 bits.

The UUCP Session Layer
The g protocol is used by the UUCP session layer to transmit UUCP commands and data files. UUCP commands are simple strings. When using the g protocol, they are sent as a sequence of data packets (not short data packets). The last data byte of the last packet in the sequence is a null byte, and the string is itself terminated by a null byte. Normally only one or two packets are required.
UUCP sends a file by simply sending the data in a sequence of data packets or short data packets. At the end of the file a short data packet is sent with a logical packet length of zero bytes.

References
Holzmann, Gerard J. 1991. Design and Validation of Computer Protocols. Englewood Cliffs, NJ: Prentice-Hall.