May 1997/Sockets: Down and Dirty Programming for the Web

INTERNET DEVELOPMENT

Sockets: Down and Dirty Programming for the Web

John W. Ross

Learn how to say "Hello, Web server" and you're on your way to writing real Web clients.

By now, the World Wide Web and the Internet have entered the public consciousness to an extent that nothing related to computers ever has. Many of the organizations we encounter daily, from companies to TV shows, now have a Web page. This is one of the most exciting and fastest growing areas of computing. Fortunes are being made by those developing applications for the Web, and I'm sure many more remain to be made. As a programmer or software developer looking for new markets, this is the place to be.

In this article, I show how to get started writing applications for the World Wide Web. I don't mean CGI programs but stand-alone applications that talk directly to a Web server (or any other type of server on the Internet for that matter). To do this you must learn a bit about the low-level details of network programming, hence the "down and dirty" of the title.

To illustrate the principles involved, I present a little utility that will retrieve the header information and title associated with any Web document — information that normally only your browser sees.

The World Wide Web consists of applications that use the Internet for communications, which in turn uses a protocol, or set of communications rules and standards, called TCP/IP. To write a World Wide Web application you only need to know two things: how to write the client part of a TCP/IP client-server application; and enough about the HTTP protocol to get a Web server to do something useful. I consider each of these topics separately, then put them together to create the utility.

Client-Server Programming with TCP/IP

This is a complex topic if you want to become fully conversant with it, and if you do I suggest you consult the excellent texts referenced in the bibliography. Fortunately, for the purposes of this article you don't have to become an expert, and I will cover what you will need to know for most applications. It helps to know a little about the protocol that Internet applications (including World Wide Web applications) use to communicate, so I start with that.

TCP/IP

The Internet (and many other networks) use the TCP/IP communications protocol. This is a layered protocol, named after the standards defined in two of its layers: the Transmission Control Protocol and the Internet Protocol.

The point of using a layered protocol is that we only have to deal with the layer of interest to us. As programmers, we get to treat networks on an abstract level, and don't have to worry about the messy, low-level details. The lowest levels of TCP/IP specify wiring standards and how TCP/IP works with an underlying physical network like an Ethernet LAN. The next layer up, the IP layer, is the part of the protocol that provides a datagram delivery service between two points on a TCP/IP network (like the Internet). We don't have to worry about any of this — what we must concern ourselves with is the next layer up, the TCP, or transmission control layer.

The reason we don't use IP directly is that it is an unreliable delivery service. A data message sent with IP is chopped up into chunks called packets and these are sent out independently from the source to the destination. IP doesn't guarantee that all of the packets will arrive, or that they will arrive in the correct order. TCP uses the packet delivery service of IP to provide a reliable, stream-oriented delivery service. Some of the properties of TCP are a) it uses a stream-of-bytes orientation — no structure is imposed on the data, b) a virtual circuit connection is established wherein the sender and receiver communicate the status of all communications and c) data is buffered before transmission for efficiency. Reliability is provided through positive acknowledgement with retransmission in the event that packets get lost.

TCP/IP therefore provides a mechanism to establish a connection between two programs on the Internet and to reliably exchange data between them.

Internet Addressing

All computers on the Internet are identified by a unique 32-bit integer called an IP address. Actually it is the connection to the Internet that is so identified, but we can usually ignore this distinction. Programs that are going to exchange data across the Internet need to know each other's IP address. We can specify this either by using dotted decimal notation like 140.100.20.1, where each octet of the 32-bit integer is written as a number separated by period, or by using the more familiar domain-name system, such as www.someplace.org.

The domain-name system consists of a naming convention, distributed data base, and Internet programs called name servers that automatically convert domain names to the equivalent IP address.

When two applications communicate using TCP, they not only need to know IP addresses, but need some means of distinguishing among multiple destinations on the computers. To distinguish multiple destinations applications use an abstract destination point called a port, which is just a small, positive integer. Certain port numbers (those below 1024) have been reserved for particular applications — these assignments are called well-known port assignments. For example, telnet uses well-known port 23 and ftp uses 21. World Wide Web applications are assigned port 80.

Client/Server Applications

Any networked application depends on communicating some information between distinct points on the network (otherwise the network is not necessary). Virtually all TCP/IP applications use the client/server paradigm for organizing their communications. Because there is no mechanism to automatically start a remote application with TCP/IP, if two programs are to communicate with one another one of them must be always running and listening for communications requests. This half of the application is called the server — it is any program that waits for a communications request from the other half of the application, the client. On receiving the request the server generates the reply, and returns it to the client. A server listens at a particular port number, so that there can be multiple servers running on a particular machine.

Servers are considerably more difficult to create than clients, which are usually just conventional application programs. Fortunately, in our case the servers are already there — these are the World Wide Web servers to which we send requests and from which we receive replies. We need only worry about writing the client software. Browsers such as Netscape or Internet Explorer are just clients, albeit complex ones, that also communicate with servers.

Client/Server Programming

Programming applications using TCP/IP require some sort of applications program interface (API) to interact with the network protocol software. The most popular API is the Berkeley Sockets interface, which formed the basis for network I/O in the BSD version of UNIX. Berkely Sockets is available as part of the operating system in most versions of UNIX, and is available as a library to programmers of other operating systems (e.g., Winsock for Microsoft Windows).

A socket can be thought of as a generalization of the UNIX file access mechanism. A program makes a system call to create a socket; the operating system returns a small integer called a handle to reference that socket, and then the program can use the traditional UNIX read and write file I/O system calls to send and receive data. The socket works just like a file, and the handle is allocated from the same set of integers used for file descriptor handles. The difference is that the socket doesn't represent a file, but a network connection. Now I examine the details of programming with sockets.

Creating and Connecting a Socket

Separate system calls are used to create a socket and then bind it to a particular destination address. The socket call creates the socket:
skt = socket(PF_INET, SOCK_STREAM, 0);
skt is the integer handle returned by the call, PF_INET specifies that the caller is using the Internet protocol family (sockets were designed to work with a variety of protocols), SOCK_STREAM is the type of service required and the third argument is not relevant for TCP connections. A value of -1 is returned if the call fails. Note that you must include the preprocessor header files <sys/types.h> and <sys/socket.h>.

After creating a socket, the application calls connect to establish an active connection to the remote server. To do so, the application must specify the server's IP address and protocol port number. Once the connection is established, the application can send and receive data to and from the server.
connect(skt, &socketName, sizeof(socketName));
skt is the integer socket identifier returned by the previous call to socket, socketName is a struct of type sockaddr_in (defined in header file <netinet/in.h>). This structure holds the protocol port number and IP address. These must be filled in before calling connect. connect returns zero on success, -1 otherwise.

The default protocol port number for World Wide Web clients and servers is 80, although it is quite common for servers to use some other port. In any event we will usually know what the port number is — it will either be 80 or some number specified by the client user. To plug the port number in to the sockaddr_in structure the application must call function htons. For instance:
int httpPort = 80;
socketName.sin_port = htons(httpPort);
This call is necessary because TCP/IP represents integers in network byte order—most significant byte first. htons converts a short int between the native byte order of the machine that the client is running on and network byte order. These may be the same, but by using htons we can ignore such architectural details.

The IP address is specified as a 32-bit integer as discussed above. However, users typically don't know this integer — generally they will know the domain name of the computer the host is running on (most likely) or they will know the dotted-decimal format of the IP address. We would like to let the user specify the address in either format and convert it to the actual IP address.

To convert a domain name to an IP address, the application can call

gethostbyname:
host = gethostbyname(hostname);
hostname is a null-terminated ASCII character string holding the domain name of the host, and host is a pointer to a hostent struct (defined in <netdb.h>). gethostbyname returns a null printer on failure. If the lookup succeeds, we can plug the IP address into the sockaddr_in structure with
   
          memcpy(&socketName.sin_addr,
                          host->h_addr,
                          host->h_length);
If the server address is specified in dotted decimal notation, the application can use the function inet_addr to do the mapping:
socketName.sin_addr.s_addr =  inet_addr(hostname);
inet_addr returns -1 on failure.

To ensure that the loader can find all these functions on System V-style versions of UNIX (e.g., Solaris) you may have to link with the socket and nsl libraries; i.e.,
cc file1.c file2.c ... -lsocket -lnsl
For those of you who prefer using Perl, you can apply these concepts quite easily. The functions gethostbyname, socket, and connect discussed above all have exact analogs in Perl, as do the other socket functions not covered in this article.

Reading and Writing on a TCP Connection

Once a socket is open, it may be written to or read from using the standard UNIX read and write system calls. There is one wrinkle when it comes to reading from a socket, however. Since TCP decouples the reading and writing process, it will not necessarily deliver data in the same blocks the data was written to. For instance, say a server sends out two 80-byte strings in two calls to write. The client may receive all 160 bytes in a single call to read or, say, 64 bytes in the first call, 90 bytes in the second call, and six bytes in a third call. It all depends on the size of the datagrams used by various pieces of the network, on buffer sizes, and network delays. This means that the client cannot depend on all the data being delivered in one transfer, but must repeatedly call read until all the data has been obtained. I demonstrate how to program this in my example utility presented below.

The HTTP Protocol

World Wide Web clients and servers exchange information using the Hypertext Transfer Protocol (HTTP). HTTP is an application-level protocol, so it is in the very top layer of the TCP/IP layered protocol. It is a fast, efficient protocol designed specifically for delivering hypertext materials. HTTP 1.0 is the current "standard" and HTTP 1.1 is the proposed standard as of this writing. Note that the W3 Consortium recommends basing future development on the HTTP 1.1 standard since HTTP 1.0 was never a true Internet standard. The Web pages listed in the bibliography provide links to the details of HTTP 1.0 and 1.1. For purposes of illustration in this article I only consider a small part of the protocol that is common to both.

HTTP is stateless; i.e., every time a client contacts a server, the server treats it as a fresh contact and retains no memory of previous contacts from that client.

HTTP interactions consist of a client opening a connection to a HTTP server, sending a request to the server and receiving the response from the server. A request contains request headers that define among other things the method to be used for the transaction. The response contains response headers describing the status of the transaction (e.g., success or failure) followed by the actual data requested.

An HTTP Request

Requests are sent to the HTTP server in a request header that consists of several request header fields. Each field is a line of text that is terminated with a carriage return-linefeed (CRLF) pair. The first field (or line) is the request line, which contains the method to be used. The GET method is the one I consider here — this is the method used to retrieve a particular entity from the server. Additional header fields may be included in the request. These additional fields allow the client to pass along information about the request and about the client itself to the server. The request is terminated with an empty line consisting of just a CRLF.

For the purposes of this article, I show how to send a simple GET request to the server without any additional fields. The syntax of the request line is
GET <SP> Request-URI <SP> HTTP/1.0 <CRLF>
where <SP> is a space and <CRLF> is a carriage return-linefeed pair. The Request-URI is a Uniform Resource Identifier and identifies the particular resource on the server the client wants to get. Finally, HTTP/1.0 tells the server what version of HTTP the client is using.

For instance,
GET /pub/WWW/TheProject.html
    HTTP/1.0
requests file TheProject.html from directory /pub/WWW relative to the server's document root. Note that this path cannot be empty, and must consist of at least / which indicates the server's document root directory.

An HTTP Response

When the server receives a request, it applies the specified method to the indicated object and returns its results to the client. This consists of a response header made up of response fields followed by the data associated with the object. The response header consists of text lines, terminated by CRLFs and separated from the data by an empty line containing only a CRLF.

The first response header field is the status line, and it will look something like
HTTP/1.0 200 Document Follows
that is, the protocol version, a three-digit status code and an explanation. Status codes beginning with 2 indicate success; those beginning with 3 indicate that further action must be taken to complete the request; those beginning with 4 indicate a client error, such as incorrect syntax; and those beginning with 5 indicate a server error on an apparently valid request.

Additional response headers contain information about the server and about the response or entity being sent.

Example

I can now put the information in the preceding sections together to create an example. What I present is a short, complete utility called webhead. Given a uniform resource locator (URL) for a Web page, webhead request the page, prints out the header information that the server returns for that page (browsers don't display this) and then extracts and displays the TITLE element from the Web page. For instance, executing
webhead www.w3.org/pub/WWW/
yields the following output:
HTTP/1.0 200 Document follows
Server: CERN/3.0A
Date: Sat, 18 Jan 1997 02:39:07 GMT
Content-Type: text/html
Content-Length: 5939
Last-Modified: Wed, 15 Jan 1997 15:45:23 GMT

TITLE: W3C - The World Wide Web Consortium
Listing 1 shows the main program for webhead. The first thing it does is call function getURL, shown in Listing 2, to parse the command line. webhead takes one argument — the Uniform Resource Locator for the Web page the user wants to retrieve. This takes the form:
[http://]hostnamem[:port]/[URI]
where material in brackets is optional. hostname identifies the host computer and may be a domain name or the host address in dotted-decimal notation. URI is the Uniform Resource Identifier, or path to the web page or directory of interest. It may be omitted, in which case the server will usually return a default web page from its document root.

Next, the application calls function openSock (see Listing 3) . Given a hostname and a protocol port number, this function creates a socket and connects it to the port on the host using the socket routines discussed in the first section of the article.

Having established a connection to the server, the application uses the functions sendStr and recvStr, shown in Listing 4, to send and receive text strings to and from the server.

sendStr is straightforward. It just uses the write system call to write a text string to the socket. recvStr is a little more complex since it must account for the fact that the amount of data available is unknown at the time of a given call to read. recvStr loops on read, inputting a single character at a time. It returns a completed string on encountering a linefeed. Note that read returns a zero on end-of-data.

Now let's return to the main program in Listing 1. After establishing a socket connection to the HTTP server's host the application calls sendStr to send the request, which consists only of a GET header and empty line. The remainder of the program comprises a simple state machine that uses recvStr to read the response from the server. If the server returns a status of 200, the application reads and echos the remainder of the header, which is terminated by an empty line. The application assumes everything else is the Hypertext Markup Language (HTML) document that defines the Web page. The application scans the page looking for the TITLE element, which is contained between <TITLE> and </TITLE> markup tags (note that markup tags may be upper or lower case). The title of the document is then printed out.

Also note that a <TITLE> tag may not have a matching </TITLE> tag, but may be terminated by a </HEAD> or <BODY> tag. A production version of the program would have to treat the appearance of body text as an implicit </TITLE><BODY> pair. It would also have to deal with cases where there are no linefeeds in the document text.

If a non-success status code is returned, the server often follows this with a short web page explaining the problem. In this case the application simply prints out everything the server sends back.

To build webhead, just compile and link the four component files. I have built and run this without modifications on several versions of UNIX: Ultrix, Digital Unix, OSF/1, SunOS, Solaris and IRIX.

Summary

Developing programs to work with World Wide Web servers is an exciting field with lots of potential for new applications. The key components — client/server programming and the HTTP protocol — are not difficult to master at the level needed for this purpose. Using the material presented in this article and the example program as a template will put you in a good position to start developing your own applications.

Bibliography

D.E. Comer. Internetworking with TCP/IP, Vol. 1, 2nd ed. (Prentice Hall, Englewood Cliffs, NJ, 1991).

D.E. Comer and D.L. Stevens. Internetworking with TCP/IP, Vol. 3 (Prentice Hall, Englewood Cliffs, NJ, 1993).

W.R. Stevens. TCP/IP Illustrated, Volume 1: The Protocols (Addison-Wesley, Reading, MA, 1994).

HTTP 1.0 Reference URL: http://www.w3.org/pub/WWW/Protocols/rfc1945/rfc1945.txt

HTTP 1.1 Reference URL: http://www.w3.org/pub/WWW/Protocols/rfc2068/rfc2068.txt

John Ross is a Senior Systems Architect at CoreLAN Communications, a company that specializes in developing custom Internet and Web-based applications. He may be reached at jross@corelan.com.