Dr. Dobb's Journal May 2000
In 1997, Bell Atlantic (BA) delivered a request for information to Open Text Corporation (for whom I worked at the time) for developing a Tariff Management System -- a web-based, collaborative system for managing its regulated documents. BA's existing system was reaching the end of its life, and, having lived through the traumas associated with migrating proprietary data formats from one system to the next every few years, the company wanted its critical data to be based on standards and be vendor neutral. Hence, they settled on SGML.
The fundamental product of the system was to be tens of thousands of pages of documents, regulated by the FCC and seven other state authorities. But, there was a catch. The system had to be capable of delivering the product in multiple formats -- paper, PDF, and potentially HTML -- and had to obey strict, legislated rules regarding page layout. The structured data of the SGML source files had to be maintained by page-oriented processes in a methodology known as "loose-leaf publishing." In a loose-leaf system, each page is treated as a separate object with its own revision history. When you deliver changes to a regulator, you deliver only a package of changed pages, not the entire document. So the challenge was to make this model work with SGML and to provide automatically generated electronic pages from document chapters.
The solution we proposed was an integration of Open Text Livelink (http://www .opentext.com/), Turn-Key Systems TopLeaf (http://www.turnkey.com.au), Arbortext Adept Editor (http://www.arbortext .com/), and Adobe Acrobat Distiller (http://www .adobe.com/). The development took place in four countries, spanning three continents. This is vastly different from most systems employed by other regulated bodies today. For most, electronic loose-leaf systems require explicit management of each page as a separate object. They are most often based on proprietary word-processor technology and require significant management overhead to ensure that only the correct pages get delivered when they are supposed to. BA's new Tariff Management System (TMS) needed to eliminate all of those layers and provide an intuitive method for allowing tariff authors to work without having to pay any particular attention to which page they were on.
Tariffs are regulated documents that define products, services, areas of operation, and rates. They are required, by legislation, to be delivered on paper. They are also required to follow literally thousands of minute regulations regarding where and how information is to appear on a page. This is because they are essentially legal contracts between the information provider and the regulator. If you are going to issue a change to a contract, you want to be sure that you do so with extreme care. Otherwise, the entire contract could be invalidated. Consequently, each page gets its own revision level and date information (when issued, and when the change becomes effective), and tracks the page history; see Figure 1. Each regulator has its own variants in terms of how this information is to appear, so any system for managing this aspect must ideally provide a simple end-user interface.
When I parachuted into the project in early 1998, my first task was to figure out how to make a system capable of maintaining pages of documents that were authored and maintained using an SGML editor. Consultation with Arbortext revealed that Adept Publisher could not do it, not even with extensive customization, unless the SGML files were divided into individual pages. This seemed sensible, but there were many context issues involved due to page boundaries and transitions. For example, what if you're in a third-level list item at the page boundary? If that's the end of the file, what kind of SGML structuring could you use to define the context in the next file? Maybe it could be done, but there had to be a better way.
I looked at several tools, ranging from FrameMaker+SGML to Xyvision XP Publisher. BA finally accepted TopLeaf from Turn-Key Systems. The first hurdle to overcome, other than time-zones and the obvious coordination issues, was technical -- getting TopLeaf to talk to Livelink.
One of the attractive aspects of Livelink is that, in Version 8, it has no GUI other than a web browser (either Netscape Navigator or Microsoft Internet Explorer). Being inherently web based, access to the system could be provided (even to BA's clients) through a simple URL. So in theory, by building a custom solution, we could build a completely web-based interface to TopLeaf that, on Windows NT, is a traditional Windows application.
But TopLeaf had no API. It was designed for a desktop environment, not a multitier client/server one. We needed to quickly demonstrate that it was technically feasible to do this. Consequently, I struck an agreement that Turn-Key Systems would fund the development of the TopLeaf API, while Open Text would fund the development of the calling interface.
Livelink Version 8 introduced several architectural changes that let us more easily develop a plug-in to the Livelink server -- namely, what Open Text refers to as "drop-ins." A drop-in is simply C++ code that extends the operation of the Livelink server by providing a way to instantiate C++ objects through Livelink's own API -- Livelink Builder (see the accompanying text box entitled "Livelink's API"). The Livelink Builder uses an object-oriented scripting language called "OScript." Objects, features, and methods are packaged in Livelink as unique OSpaces. The TopLeaf API is a set of functions exposed in a DLL and a command-line interface. For this application, it was simpler to launch a process to issue a CLI instruction than to integrate the DLL function calls. So the drop-in we developed became a generically useful tool to execute a system call, something that does not ship with the standard Livelink. It is functionally identical to C's system function -- feed it a command, and it executes it.
The difficulty we encountered was that the drop-in would not function in Livelink's multithreaded environment when run from the Livelink service. We spent weeks tracking down the problem. It worked fine from Livelink Builder's interface, which was single threaded. As soon as a thread was created to run the external process in the service, however, it would die. Finally, I stumbled on the bug -- the drop-in had been compiled as a multithreaded DLL, but for synchronous operation when executed from a Livelink thread, it had to be compiled to be single threaded. So even though Livelink is multithreaded, we could not, without substantial rework to Livelink, run this bit of customization in a true multithreaded environment.
The TopLeaf API exposed only a few functions for us at first -- things such as deposit an SGML file into TopLeaf, typeset it, get it back, print it. As the complexity of the typesetting requirements became more fully understood, additional functions were added. My approach was to tackle the integration in bits and pieces: First, demonstrate that Livelink could call a TopLeaf function from Livelink's web-browser GUI, and get a resultant status back. Next, rethink the entire problem of page-revision control. Finally, implement the functions necessary to achieve that control.
In this iterative process, I decided that all issues related to page generation were questions of formatting alone, and intrinsically had nothing to do with the SGML data itself. The TopLeaf and Livelink databases could easily maintain the revision information required. TopLeaf is quite capable of laying out different bits of information onto explicit locations on the page (see Figure 1, for example).
By relying on TopLeaf to implement all the page-related issues, we could separate the page-maintenance aspects from the SGML source data. Consequently, there is nothing embedded in the SGML data that is required for the maintenance of the page's revision history. (There is one bit, however, that does get inserted. As TopLeaf processes the initial SGML data, it inserts programming instructions to indicate where the page boundaries occur.) Turn-Key Systems extended TopLeaf to allow storage of page-revision information within special markers associated with the paged SGML data, but not embedded in it. These are referred to as "leaf indicators." And so TopLeaf stores and outputs all the required formatting instructions related to pagination, at the instruction of Livelink.
Where possible, I attempted to ensure that GUI functions in TopLeaf had an analog from the web interface. In several cases, GUI functions were collapsed into a single operation. For example, TopLeaf permits typesetting, previewing, and printing all to be carried out as separate tasks. Typesetting is by and large an automated task from the Livelink interface, which is run whenever a new version of the SGML file is checked in. To generate a printed copy of a page, Livelink users run a Render function, which runs (via the TopLeaf API) typesetting, followed by a print operation to generate PostScript. On Livelink's container objects (where the SGML files reside), users run analogous functions, which iterate over each child in the container.
Because none of us at Open Text had experience with TopLeaf, and because the API was being developed at the same time as the project, we had a substantial learning curve. As we implemented functions in Livelink, I would first figure out what TopLeaf needed, then script the API calls by writing them down in sequence to achieve a required result. Once that was done, the function calls could be implemented in Livelink.
By the spring of 1999, we had implemented most of the required high-level functionality. The system was capable of generating tables of contents, lists of effective pages, and ancillary material such as title pages and indices.
Indices presented an interesting problem. The notion of an index in this environment is a document consisting of section headings and the associated page numbers. Section headings can repeat many times throughout a tariff, and so an index is a listing of all common section headings. TopLeaf was not able to provide this type of information, so we wrote a custom program to convert TopLeaf's table of contents into an SGML file that complied with BA's DTD. The result was a configurable filter that could swap SGML tags and build a doubly linked list of section names and their page numbers. The resultant file is then sorted alphabetically and passed back to TopLeaf for typesetting.
The term "rendition" was used within this application to make users aware that the output of these programs was not a print image of their SGML file. Rather, output consists of renditions of those changes made to the SGML file since it was last published. While the system does generate and store PostScript files, it can also take that PostScript and create multiple viewable formats of it. The prime format, of course, is PDF.
Initially, we used GhostScript to generate the PDF. I wrote a simple executive program to control the data flow from TopLeaf back into Livelink. In the first version of the program, which was written as a Windows NT service in C++, the executive established change handles on input directories defined through a startup configuration file. When a file landed in a monitored directory, a rendition object was instantiated and methods were invoked to call GhostScript to write the PDF. The program continued developing to provide for automatic uploading of the PostScript and the PDF rendition back into Livelink using Livelink's C API. The uploaded files got deposited into a predefined location in the Livelink document manager (declared in the program's configuration file). This program, which I named "PS2HTML," was also intended to extract the text from the PostScript, place an HTML wrapper around it, and deliver a faithful content representation of what was in the PostScript without regard to format. (For more information, see the accompanying text box entitled "PS2HTML.") Additional formats were also generated. Using ImageMagick, GIF files of each page were output, one per page, and an HTML wrapper was generated to allow forward/backward viewing control of each page. When uploaded into Livelink, these files give a quick screen-resolution-quality view of the pages without requiring users to have Acrobat Reader installed.
As I enhanced PS2HTML, we ran into a snag: The font BA required was Palatino. Because GhostScript cannot embed Type 1 fonts in its PDF (with the exception of Times and Helvetica), the PDF produced by GhostScript only contained Type 3 fonts, which are nonsearchable image translations of the real font. This meant that Adobe's Acrobat Distiller was the only tool available that could embed the required font. Acrobat Distiller is a single-threaded windowed application and is not designed for operation in a server environment. But since we sometimes have to work with what we've got, I modified PS2HTML to make the renditioning engine configurable -- you could tell it to use either GhostScript or Distiller (or another one of your choosing) by setting some system-level environment variables and a couple of flags in the startup configuration file.
The complexity of PS2HTML grew, because now I was converting it from a console application into a windowed application in order to pass messages from one invisible PS2HTML window to Acrobat Distiller and have a completion message returned (see Listing One). Without that messaging interface, I could not guarantee that multiple PostScript files would be properly handled by Acrobat Distiller, and indeed, there were increasing numbers of cases where multiple web requests to render changes resulted in bad timing issues at the server. This could only be addressed by synchronizing Distiller with the calling application.
In its current version, PS2HTML implements a BSD-like control file interface, so that rather than just watching Windows NT directories for a PostScript file, it now looks for a ".cf" file that defines which file on the file system is the PostScript file for the job. This has the net effect of making the synchronization of the entire process much more robust: Processing PostScript does not begin until the configuration file is written, and since the request originates from within Livelink, Livelink can inform PS2HTML as to which object the initiating request was for, and which user requested it. This allows output to be uploaded back to the same container parent where the rendered SGML updates reside.
The Bell Atlantic Tariff Management System is a ground-breaking approach to loose-leaf publishing in general. And while the BA application specifically addresses the requirements associated with highly controlled document pages that must follow sometimes arcane and archaic process models, the underlying technology holds a great deal of promise for distributed environments where precision typesetting of high-volume data is required. This includes most regulated industries and legal documentation, such as legislation, but could also play into financial applications and the insurance industry.
Most regulators still require paper as the delivery mechanism, with various electronic formats as unofficial versions of the paper. There is still reluctance to accept electronic format due to the ability to manipulate the data in the electronic version outside of change-control processes. Attitudes are slowly changing on this front, thanks in part to digital signature technology. One of BA's regulators, the New York Public Services Commission, in April of 1999, announced it would begin accepting PDFs as official versions. This is a substantial breakthrough. The PSC receives filings from more that 80 different sources, some small, and some large -- such as Bell Atlantic. For some sources, where their submissions are small document sets, traditional technologies and processes still work well, so electronic submission is a huge leap. For others, the management costs associated with page- oriented systems can be staggering. Regulators are trying to find a happy medium, but forcing migration to a specific set of standards, such as XML and SGML, is not what they want to do. So in the mean time, there is a tendency to tread very cautiously.
The TMS architecture (see Figure 4) is sound, but there is room for improvement. Eventually, I would like to see tighter integration between the TopLeaf and Livelink repositories, ideally replacing Livelink's document store with TopLeaf's alone. And migrating into a true multithreaded environment will, I think, be ultimately necessary for large-scale applications. As for now, the major challenges have been addressed, and the TMS demonstrates how all of the products are highly capable.
The core team for this project consisted of Marc Stewart (Open Text Professional Services), John Cockram and Jeff Maynard (Turn-Key Systems), Harry Read (ABMCL), Denise English (Bell Atlantic), and myself. The opinions expressed are those of the author, and not necessarily those of Open Text Corp., its affiliates, Turn-Key Systems, ABMCL, or Bell Atlantic.
DDJ
if (USE_DISTILLER) {
//distiller structures setup
DISTILLRECORD dr;
COPYDATASTRUCT cds;
BOOL ok;
WORD res = 0;
cds.cbData = NULL;
cds.dwData = NULL;
cds.lpData = NULL;
strcpy(dr.fileList,"");
strcpy(dr.outputFile,"");
dr.param = NULL;
//instantiate a window for the PS2HTML application
CPS2HTMLWnd *app_window = new CPS2HTMLWnd;
PS2HTMLApp.m_pMainWnd = app_window;
//ConvertString is the command-line call to run Distiller.
res = WinExec(ConvertString);
if (res<0) {
fprintf(stderr, "Could not start distiller. Error: %d", errno);
return FALSE;
}
//locate the Distiller window
CWnd *hDistillerCWnd = CWnd::FindWindow("Distiller", NULL);
if (hDistillerCWnd != NULL)
{
//Hide Distiller's window. This is primarily done to prevent someone who
// happens to be logged in at the server from pressing the "Cancel"
// button while Distiller is running.
hDistillerCWnd->ShowWindow(SW_HIDE);
strcpy (dr.outputFile, outfile);
strcpy (dr.fileList, sourcefile);
dr.param = EQ_NO_SAVE_DIALOG;
cds.dwData = DM_DISTILL;
cds.cbData = sizeof(DISTILLRECORD);
cds.lpData = (PVOID)&dr;
//tell Distiller to start distilling
ok = (BOOL)hDistillerCWnd->SendMessage(WM_COPYDATA,
(WPARAM)app_window->m_hWnd, (LPARAM) &cds);
if (ok)
ok = (BOOL)hDistillerCWnd->SendMessage(WM_TIMER, ID_TIMER, 0L);
COPYDATASTRUCT *data ;
data = app_window->GetMsgData();
//parse the data to determine if the message is a completion
// message (DM_DONE) or some other, and act accordingly...
/* ..... */
}
//clean up
app_window->DestroyWindow();
delete app_window;
}
}
<elt name="title" context="sec" > <!-- define what to do when the start-tag is encountered--> <stag> <!-- define what to do when entering the tag --> <onentry> <!-- assign some variables --> <var,_LssNum,0> <var,_LsssNum,0> <var,_LaNum,0> <var,_LANumX,0> <var,_LanNum,0> <var,_LansNum,0> <var,_LansNumX,0> <var,_LExhibitNum,0> <!-- execute typesetting functions to clear and then set an indent, then set the font to a defined font (Fc14) in Bold. --> <eval,<imz><ima,<_HTab1>><Fc14><B>> <!-- Read the data stream up to and including the end tag for this element, and assign into the variable _Lsec --> <get,_Lsec,</title>, <!-- assign TOC-related variables --> <eval,<TT,1,<_LsNum>>> <eval,<TOC,1,<TTR>,<_Lsec>>> <eval,<TOCR>> <!-- Output the section number, and also define that as a running-head --> <eval,<_LsNum>.<nr,1><_LsNum>.<im><_Lsec> <n><im><_Lsec><epx,0,1,1>^p> <!-- Output what was read in from the data stream, followed by the end-tag --> <*/title> > </onentry> </stag> ^(end of the start-tag declaration ^) <etag> ^(start of the end-tag declaration ^) <onentry> <!-- Look at the next tag in the data stream --> <peek,_peeked,<nop>> <eval,<match,<_peeked>,/sec, <fl,1p><h,1><epx,0>^p > > </onentry> </etag> </elt> ^(end of the element's typesetting mapping ^)