If you're not careful, a complex project can set you spinning in circles. This tool can help set your project straight.
A development team has been given the task of creating a transactional processing system that will receive text files containing transactions of various types, perform data validation, and load the data into the database. After a brief design meeting, the team decides that the following components must be created:
- A parsing class to read the data from the files.
- An exception class for error handling.
- A class to handle the database interaction.
- A base class to serve as the transaction interface.
- A set of derived classes that each represent a specific type of transaction.
The team divvies up the assignments and goes their separate ways. Bob will write the base transaction class, Jim the derived transaction classes, Sarah the exception class, John the parser class, and Jane the database class.
After all of the components are created, compiled, and linked, the testing begins. On day one, Bob announces that a function in the base transaction class was supposed to return int instead of bool and schedules a compile for about noon. Jim says that he will recompile and retest the derived classes when Bob is done. Sarah, John, and Jane are happy that they are unaffected.
On day two, Jim announces that there is a bug in one of the derived transaction classes and that he will recompile and retest later in the morning. (No one else cares leaving Jim to wonder why he said anything at all.)
On day three, Sarah wants to modify a public function in the exception class and will recompile at about 10:00 a.m. This is a big deal. Bob, Jim, and Jane all must recompile and retest their code.
John notices that Jims tran_3 class has a date utility function that sounds like something he has been trying to work the bugs out of and decides in the spirit of code reuse to include it in his parser.
On day four, Bob announces that he is going to change the int return back to a bool. This means that Jim is going to have to recompile and retest the derived classes. Then John says that he will recompile and retest the parser when Jim is done. Now everyone cares about Jims recompile. Sarah says that she must recompile and retest, then Jane, then Bob, and then Jim. The team now realizes that if any of them makes a change to their code, all of the components in the system must be retested.
Finding Cyclic Dependencies
This is an extremely simplistic example but it shows how easily a cyclic physical dependency can creep into a project. A physical dependency is created by using an #include directive to bring the definition of another class into a translation unit. In the real world, a cyclic dependency may be much more difficult to find. Imagine a project with scores of files, each with a dozen #include directives. It is easy to see how a cycle can remain undetected, only manifesting itself as very long compile/link cycles and bugs in code thought to be sufficiently tested. Poor communication within a team may also compound the problem, as not all of the players may be aware that the actions taken with their own code cascade throughout the entire team.
A simple UML diagram can make the cyclic dependencies a bit easier to spot. Figure 1 is a class diagram of the original project. The dashed lines are drawn from each class to the classes on which it depends. (Note that the multiple paths from trans_b to parse are acceptable provided that the headers contain the appropriate code blocks to prevent multiple inclusions.) Figure 2 shows the resulting cycles created by the inclusion of tran_3.h into parse.h. Although only the cycle itself is in bold, all of the components in the project depend upon a component in the cycle, so the entire project is affected by a change to any of the components in the cycle. Figure 3 shows one possible solution to break the cycle. Placing the date functionality in its own class is an obvious solution since Jim never should have included it within the derived transaction in the first place.
This solution brings about a few interesting changes. The parse and tran_3 components are now free to use the common code in the date component, but tran_3 is also free to be modified without affecting any other components. This arrangement works out well for tran_3 because it is probably going to be subjected to the most pressure for functional changes as the project matures and the business rules governing this transaction migrate. The parse class has been altered as well but in the opposite manner. It now has a dependency that it did not have in the original design. It is obvious by looking at Figure 3 that the date component must be very stable because a change will cascade through the entire project.
Robert Martin formalized this concept [1] by defining the Stable Dependencies Principle as a package should only depend upon packages that are more stable than it is. The design in Figure 3 will work fine provided that the most depended upon components are also the most stable. The design in Figure 2 not only creates the situation where a change to any component will cascade throughout the entire system, but it also guarantees that this will happen by making all of the components dependent upon the least stable component. It is this property that makes all cyclic dependencies inherently destabilizing.
The occurrence of cyclic dependencies and possible solutions are covered from a similar object-oriented design perspective in [2] and [3]. John Lakos also covers this topic focusing on the proliferation of link cycles in Chapter 4 of Large-Scale C++ Software Design [4]. In this book, the focus is on the nature of the dependency graph, and how that affects the number of components that must be linked in order to test each.
For example, in order for a component to be tested, a test driver must link to that component and all of the components upon which it depends. If the original seven components were not dependent upon each other at all, it would take seven test drivers a total of seven links to test the entire system. (This system would not provide very much functionality, but it would test very quickly.) The system depicted in Figure 1 requires seven test drivers and twenty-one links (an average of 3.0 components linked for each one tested). Once the cycle has been established (Figure 2), the same seven components require thirty-one links to test (4.4 components linked for each one tested). The solution shown in Figure 3 adds an additional component, and therefore an additional test driver, but the eight components only require a total of twenty-nine links (3.6 components linked for each one tested). In general, an unconnected system with N components requires N links while the same system completely connected requires N2 links. By comparison, a graph resembling a binary tree, which is more likely to represent a real-world dependency graph than the other two extremes, requires NlogN links. Testing a single component in a system with hundreds of files may take several minutes or hours to link depending on the nature of the dependencies.
I suggest that you read the sources listed below for a much deeper understanding of the subject. The remainder of this article will focus on identifying cyclic dependencies programmatically.
The Program
The dep_graph class searches out dependencies and builds a dependency graph (see sidebar Graphs and Graph Traversal) similar to the graphs shown in Figures 1-3. dep_graph has a very simple interface for the construction of a directed graph, the processing of that graph, the output of the dependencies of each file, and the output of all possible paths for each file.
The dep_graph class requires three input files:
- files.txt A file that contains the names of the files to be processed.
- search_paths.txt A file containing the search paths to find these files.
- defines.txt A list of the preprocessor definitions that are to be provided to the compiler.
During the construction of a graph, the input files are read, the search paths are used to find the files to process, the preprocessor definitions are applied to the files, and a directed graph based on the #include statements that exist in each of the files is created. After the object is created, the graph can be processed and the output produced.
Each of the names listed in files.txt (as well as each of the names that have been included in these files) are treated as nodes in a potential graph. (I say potential graph because the program will map all of the paths regardless of whether the nodes form one or several graphs. If a change is made in the dependency graph that severs a path, then the output will reflect the paths that exist in both graphs.)
The physical dependencies between files can be determined using the #include directives alone [4]. This is because a file cannot use a component that is defined in a different file without either including the definition of the class that represents that component (to use an instance of) or by indirectly including a file that includes that definition (to use a reference of).
The processing for a particular file in files.txt begins with that file being read into a string. (If the file cannot be found in any of the search paths, it is ignored.) After the file has been read, the preprocessor definitions are evaluated and all excluded text (based on preprocessor directive conditional statements) is removed. The purpose of this step is to allow for conditional inclusion of files. This may be necessary for source code that is to be compiled on multiple platforms. Listing 1 shows tran_3.h and parse.h where I use conditional inclusion of files to test the various configurations shown in Figures 1-3.
A separate class (function object) handles the preprocessing stage. The workings of this class are beyond the scope of this paper, so I will not spend much time on details. In order for a class to serve as a preprocessor, the class must have a default constructor and a function call operator that takes the file to be processed as a string& and the file containing the preprocessor definitions as a const string&. The operator()(string&, const string&) does the actual work of evaluating the directives and pruning the file.
Aside: I am sure that there are utilities available specifically for this purpose. My primary reason for writing the utility myself was that upon reading about postfix and infix expression evaluation in Algorithms in C++ by Robert Sedgewick [5], it seemed like a fun exercise. The important thing is that the dep_graph object can use a preprocessor object to create a file that does not contain ambiguous #include directives. The preprocessor is provided as a template parameter, so users of the dep_graph class have the ability to wrap their own preprocessor in a function object and use it instead.
Once a file has been preprocessed, it is searched for #include statements. The node is created using a map<string, vector<string> > to represent the node and its associated edges. The traversal of the graph begins with a call to the dep_graph::check_deps function. This version does not take any parameters and is not recursively called. This function calls the recursive version check_deps(const string& cur_dep) passing the name of one of the nodes in the graph as it argument. This call marks the start of a particular path. The recursive function calls itself passing each of its connecting nodes as arguments. The function returns if it finds a node with an empty node list (no edges) or it encounters a node in the path more than once. This is the simple algorithm for performing the depth-first search of the directed graph.
check_deps(current node name) add this node to the current path if this node has no edges, return if this node is already in the path, set cycle flag, return check_deps(next node name in the current nodes list of connecting nodes)Listing 2 shows the implementation of this function.
Using the Program to Map Physical Dependencies
There are two functions that print data to an output file. The function print_all_files(ostream& out) creates a list of each file that is either listed in files.txt or has been #included in one of those files. The function print_all_paths(ostream& out) prints all dependency paths beginning at each of the files either listed in files.txt or that have been #included in one of those files.
The analysis is performed iteratively by starting with a short file list and expanding it based on the results in the output file. This control is necessary so that the program does not run down all the dark alleys of the vendor library files without the user specifically asking it to do so.
This is an example session based on the code described above. First, I start with the list of the *.cpp files in the project. Figure 6 shows the contents of files.txt and the resulting output.txt. Each of the files listed in files.txt has a path for each of the #include directives.
This is obviously not enough information, so I take the filenames that terminate the paths in the first output.txt and add them to files.txt. Figure 7 shows the result of this second run. A comparison to Figure 1 shows that all of the paths represented in the UML diagram have been produced on a separate line in the output.txt file. In this case, the system was sufficiently analyzed in just two iterations. A larger project may take several. As expected, the output contains no cycles.
The requirement that a file must be explicitly listed in files.txt may seem like a burden at first. However, had there been the usual set of standard #include files for iostream and STL libraries, the output would very quickly grow to an unusable size. This requirement allows the user to chase down a specific path through a library without viewing paths that are not covered in the current development.
Figure 8 contains the output.txt file that was created after parse.h included tran_3.h (by placing VER_2 in the defines.txt file). This particular dependency graph is so small that all paths result in a cycle, although this is not always the case. The program stops pursuing a path as soon as it has encountered the same node twice and labels that path with the tag {cycle}.
Figure 9 shows files.txt and output.txt after the date class was added and parse.h and tran_3.h were modified to include it (by placing VER_3 in defines.txt). Note that there are no cyclic dependencies.
Conclusion
This program provides a simple interface to produce a list of dependencies and all dependency paths for a desired group of files. The current driver that I use to run this program simply passes the command-line arguments to the constructor of a dep_graph object and then calls the two output functions. The dep_graph object can also be placed inside an enhanced driver if a more user-friendly interface is desired.
The best use of a utility such as this is to be periodically run on a system as part of the development cycle to prevent either the initial design or subsequent enhancements from introducing unforeseen and unwanted dependencies, particularly cyclic dependencies. (This can take place as soon as the initial headers have been created and long before any of the actual implementation has been written.)
If you are interested in a more robust set of applications that provides features such as adherence to certain coding standards and project statistics (levels of dependency, component dependency count, cumulative component dependency count, etc.), Appendix C in Large-Scale C++ Software Design by John Lakos [4] describes such a set of applications.
Source Code
The source code for this article is available for download at <www.cuj.com/code>. dep_code.zip within bergin.zip contains the code for the dependency graph and preprocessor utilities, as well as the driver. The program compiles on Hewlett-Packard 3.31 and Microsoft Visual C++ 6.0. The code for the dependency graph class has the implementation in the header so that the template member functions can be used. This is a trade-off to get the flexibility of a templatized preprocessor. project_files.zip also included in bergin.zip contains the stubbed files that make up the example project used in the article to allow the program to be run on real code.
Notes
[1] R.C. Martin. Large-Scale Stability, C++ Report, February 1997.
[2] R.C. Martin. Granularity, C++ Report, November-December 1996.
[3] R.C. Martin. Design Principles and Design Patterns, <www.objectmentor.com>, 2000.
[4] John Lakos. Large-Scale C++ Software Design (Addison-Wesley, 1996).
[5] Robert Sedgewick. Algorithms in C++ (Addison-Wesley, 1998).
Thomas Bergin has been programming C++ applications for the last six years after having spent ten years as an environmental geologist. He is currently working at the Kansas City, MO office of the National Association of Insurance Commissioners.