November 2000/Extracting Data from X-Y Plots

Scientific Numerical

Extracting Data from X-Y Plots

Rainer Thierauf

Scanners now input text with reasonable accuracy — why not graphical data as well?

Introduction

It is often necessary for scientists working in their field's theoretical branch to understand data that their colleagues from the experimental branch have measured. To "understand" data, a theoretician usually works from a theoretical model and tries to derive a mathematical function that fits the experimental data. If there is a poor fit between the data and the function, then the model is apparently incorrect and needs either to be improved upon or entirely discarded. On the other hand, if the fit is good, then the model is vindicated and the data is "understood" within that theory.

Experimental data is usually published in the form of a diagram such as shown in Figure 1. Theoreticians, however, prefer to have experimental data available in raw format, listing the x and y values of the data points, such as shown below:
16.7 81.4
19.3 45.7
22.6 24.6
25.6 18.8
28.4 17.6
...
The above table lists the first five data points (from the left) shown in Figure 1. The theoretician needs the numerical values in order to compare theory (i.e., the mathematical function derived from the model) to fact (the experimental data). Extracting the data as accurately as possible from a diagram is a cumbersome and difficult task, and is usually delegated to a graduate student. As a graduate student in theoretical physics, I had to do this many times. When I asked fellow (suffering) graduate students for the best approach, they suggested the following algorithm:

Use a photocopier to enlarge the diagram.
Select a data point from the diagram and project it onto the y axis, yielding the y value of the data point.
Repeat step 2 for the x axis, yielding the x value of the data point.
Repeat steps 2 and 3 for the error bars (if any) of the data point.
Repeat steps 2, 3, and 4 for all data points in the diagram.

I have run across diagrams that contained 50 or more data points, error bars being the rule rather than the exception. Such diagrams make data extraction a highly repetitive and mind-numbing job. Just the kind of job meant for computers.

Data Extraction with SCANDAT

When I was a graduate student, I wrote a program called SCANDAT, which implements the data extraction algorithm presented above. SCANDAT is a Windows MDI (Multiple-Document Interface) application. It uses MFC (Microsoft Foundation Classes) mostly for the MFC CScrollView class, which greatly simplifies the task of displaying the diagram in the client window. SCANDAT takes advantage of the device coordinate system of the client window and sets up a mapping between this coordinate system and the physical coordinate system of the diagram under consideration. When used with care, SCANDAT can save a great deal of time and extract data more precisely than the manual method, especially when logarithmic scales are involved (as in Figure 1).

SCANDAT begins by displaying an empty document. As the first step, the user imports a diagram (menu selection: Edit->Import Bitmap), which was presumably scanned from a publication. Only device-independent bitmaps (Windows bitmaps, *.BMP) are supported. Initially, SCANDAT displays the bitmap to scale. The user can choose to magnify the diagram (menu selection: View->Zoom). In fact, I recommend using magnification for all of SCANDAT's operations, as this improves the accuracy of the extracted data.

In the second step, the user sets up the mapping between the device and the physical coordinate system. To this end, the user defines the physical axes by marking two points on each axis. These defining points are labeled x_A, x_B, y_A, and y_B. The user begins by selecting "Define x_A" from the Edit menu and then moves the mouse to a tick mark on the physical x axis and left-clicks there. A dialog pops up, into which the user enters the physical value for the selected point, which can be read right off the tick mark itself. Point x_A is now defined. The user continues in the same way for points x_B, y_A, and y_B, each time selecting the proper item from the Edit menu, left-clicking the axis point and entering its physical value into the pop-up dialog. I recommend selecting the two points defining each axis as far apart as possible from each other. This improves accuracy.

If either one or both of the axes use a logarithmic scale, the user needs to check the appropriate menu item from the Options menu (menu selection: Options->x log scale and Options->y log scale). SCANDAT records device and physical coordinates of the selected points, thereby setting the transformation scale between the device and the physical coordinate system. Note that both physical axes can now be expressed as linear functions in the device coordinate system. The general form for these functions is:
y_d(x_d) = m*x_d + c                (1)
where x_d and y_d are device coordinates along one of the physical axes. The above equation describes a line in the device coordinate system. The coefficients m and c can be expressed entirely in terms of the device coordinates of the two points (A and B) defining the axis:
m = (B.y_d - A.y_d)/(B.x_d - A.x_d)  (2)
c = A.y_d + A.x_d*m                (3)
where A.x_d and B.x_d are the x device coordinates and A.y_d and B.y_d are the y device coordinates of points A and B.

During the third and final step, the user acquires the data. The user initiates this step by selecting the menu item Edit->Acquire Data and then left-clicks each data point in turn. The status bar at the bottom of the application window displays the device and physical coordinates of the current mouse position during this step. If the data points have no attached error bars (menu items Options->x error bars and Options->y error bars unchecked), then each mouse click marks one data point. At the position of the click, SCANDAT draws a small circle to mark the selection.

If one error bar option has been selected, the tip of the error bar must be clicked after its data point. SCANDAT draws a line from the data point to the tip of the error bar. If both error bar options have been selected, the first click marks the data point; the second and third clicks mark the tips of the x and y error bars. Note that the error bar options refer only to the data points marked after the checking of the option. Therefore, it is possible to mark data points with and without error bars in the same diagram. Marking points and error bars can be undone and redone (menu selections: Edit->Undo and Edit->Redo).

Once all data points have been marked, the user saves the data points to an ASCII file (menu selection: File->Export Data). The format for each data point is:
<x-value> <y-value> [<dx> <dy>]
where dx and dy are the sizes of the error bars (if any) in x and y direction. This is the data format I have found to be most useful. The source code (method OnFileExport of class CScanDoc) can easily be modified to support other formats.

Some diagrams contain multiple series of data points that the user may want to extract to different files. In this case, the user should work with multiple documents, one for each series of data points.

SCANDAT supports MFC serialization. This means that the user can save a SCANDAT-type document and reopen it later. The SCANDAT-type document contains all information entered by the user, such as the imported bitmap itself, the scales, the selected options, and the data points marked by the user. The opening of a SCANDAT-type document, however, should not be confused with the import of a bitmap into a document.

Implementation

SCANDAT sets up a 2-D coordinate transformation between the device coordinates of a data point and its physical coordinates. If you're interested in the math, see the sidebar "Walking through the Math." The calculations are implemented by class CAxis, which represents an abstraction of a physical axis. Listing 1 shows the interface to CAxis, which consists mostly of get and set functions (omitted in the listing) to manipulate the private data members.

The methods that actually do something are SetScale and GetPhysValue. SetScale sets up the mapping between the device coordinate and physical coordinate system. GetPhysVal calculates the physical value of data point D (input parameter) for the axis upon which the method was called. GetPhysVal calls two private member functions, project_data_point and get_physical_value. The first projects a data point onto an axis and determines the device coordinates of that projected point. The second calculates the physical coordinate of the projected point along the axis, given its device coordinates. Listing 2 shows the source code for these methods.

Note that SetScale employs a trick to handle the possibility of division by zero. This can occur if the device x coordinates of the axis defining points are identical. This means that the slope of the corresponding axis is infinity. In other words, that axis is a vertical line. The trick in this case consists of replacing infinity with a very large number. This saves me from having to check for this special case at every step. Of course it introduces a certain amount of inaccuracy. This inaccuracy, however, disappears within the overall inaccuracy of the extraction, which I estimate at about one percent. Physicists use this kind of trick (some call it "Epsilontik") all the time, making mathematicians cringe.

While MFC supports the loading of bitmap resources stored within the executable, MFC does not provide support for reading and writing bitmap files. For this purpose, I created a bare-bones bitmap class CBMap, which does just that. This class, as well as the full source code for SCANDAT, is provided in the CUJ online archives (see www.cuj.com/code).

Conclusion

SCANDAT provides a quick mechanism to extract data from (scanned) diagrams. When used carefully, users can achieve a one percent accuracy level, meaning that the data extracted with SCANDAT deviates by one percent or less from the original data. At this accuracy level, a reproduction of the original diagram using SCANDAT data cannot be distinguished from the original with the naked eye. The same could not be said of my best manual efforts. This accuracy level certainly allows theorists to use SCANDAT data when comparing theory to experiment. Furthermore, it allows theorists to publish their theoretical function and the SCANDAT extracted data in the same diagram.

Acknowledgements

The original version of SCANDAT was written in 1993 while I was a graduate student at the Institute of Theoretical Physics in Tuebingen, Germany. That application was developed using Borland Turbo C/C++ v3.1 for Windows 3.1. I would like to thank my former fellow graduate student Detlev Bueckers (Detlev, if you are reading this, get in touch with me), who pointed out that bitmaps can easily be displayed within a Windows application and who gave me my first introduction to Windows programming. Without his help, I would still be doing data extraction the hard way.

Rainer Thierauf has a Ph.D. in nuclear physics from the University of Tuebingen, Germany. He currently works as an independent contractor for COMSYS in Beaverton, Oregon. He can be reached at rathierauf@aol.com.