Wondering where to stuff that extra century information in your current date fields? Here are a few candidate hiding places.
Date compression is a technique that enables legacy data and software to handle dates beyond the year 2000. This technique is used when expanding the size of a non-compliant date field (typically, a field with two-digit year representation) is not an option. Many compression formats have been proposed earlier which present different degrees of compression and readability. Here I introduce a family of formats enabling storage of dates in a form equivalent to MMDDCCYY format. The major advantage of these formats is that they require a space equal to or less than six characters. They accomplish this by using numbers with a base larger than 10 (e.g., hexadecimal); and by using information hidden in the day of the week (Sunday, Monday, etc.) to calculate the centuries digit (CC) missing in MMDDYY format. The algorithms associated with these formats are based on a property of the Gregorian Calendar its 400-year period. That is, dates separated by a multiple of 400 years fall on the same day of the week.
The problems associated with calendar date processing are not language specific. So, although I discuss only a C/C++ interface here, a similar approach can be applied in any other language environment, such as COBOL, Fortran, and so on. C/C++ implementations typically represent dates as long integers or as character arrays. The long integer format does not present a problem because it can hold dates in MMDDCCYY format or its equivalents. If it is possible to use a long variable to hold this date then conversion becomes a trivial task. The problem with date overflow appears when it is stored as a six-symbol character array. The most challenging task is to convert a date stored in MMDDYY format to a date in a format convertible to MMDDCCYY without expansion of the date variable. The length of the converted date must be equal to or shorter than the original. This is the problem I consider in this article.
Compression Formats and Algorithms
The formats and algorithms presented here are based on the 400-year period of the Gregorian Calendar. (See the sidebar, "The 400-year Period of the Gregorian Calendar".) The periodic property enables calculation of the missing century (CC) value given the month (MM), day (DD), year (YY), and a day of the week (W). The formats also make use of numbers with a base larger than 10. For example, the MM fields can be replaced by a single-character M field with a base of 16 or higher; DD fields can be replaced by a single-character D field with a base of 32 or higher.
(Mh)(Wh)DDYY Format
This format represents dates with a hexadecimal month field, a hexadecimal day-of-week field, and decimal day-of-month and year fields. To convert a MMDDYY date to this format I first replace the MM field with a single hexadecimal character. This frees up one character for storage of the day-of-week (W) value. The resulting format becomes (Mh)WDDYY. Here (Mh) stands for a month represented as a hexadecimal number; W is a day-of-week number where Sunday is 0, Monday is 1, and so on. Suppose I substitute the missing CC field with an assumed value (17, 18, 19, or 20) and calculate the corresponding set of Ws. The periodic property of the calendar ensures there is only one CC in any given 400-year interval that produces a day-of-week equal to the W value in the (Mh)WDDYY format.
This format works for any 400-year interval. To expand date storage to an 800-year interval the format uses two different offsets for the W field. Numbers from 0h to 6h represent days of the week in the interval 1600 to 1999. Numbers from 7h to Dh serve for the 2000 - 2399 interval. The format becomes (Mh)(Wh)DDYY. If I select an offset for (Mh) of 1, thus representing January as 2h, February as 3h, ..., and December as Ch, then use of the (Mh)(Wh)DDYY format becomes detectable via software. The first M digit in the original format is always 0 or 1 while it is equal to 2 or greater in the new one. For a definition of detectability see the sidebar, "Evaluating Date Compression Formats." Figure 1 shows routines to convert six-character dates to and from this format.
(PDDDDD)h Format
A widely used format in the fields of astronomy and chronology is the continuous day counting method. This format uses six decimal digits (DDDDDD) to count days past 01/01/4713 BC. This so-called Julian Period was introduced in 1583 by the French scientist Joseph Scaliger. The start of the Julian Day (JD) is a median midday at the 0th (Greenwich) meridian. The median Greenwich midnight preceding this median midday defines the beginning of the considered calendar date. To convert this format to (DDDDD)h I first divide the JD into periods of 400 years. Then any date can be represented as a period number (P) and a number of days between the period start and the considered date. The starting date for a period is arbitrary, so my format uses convenient dates such as January 1st, 1600 and January 1st, 2000.
This date format is essentially equivalent to JD. The only difference is an offset equal to the Julian Day of a period's starting date (2,305,447 for January 1st, 1600; and 2,451,068 for January 1st, 2000). Each 400-year period consists of 146,097 days [3], so it takes six decimal characters to represent any date in a 400-year interval.
To free space for a period number I convert this six-digit number to a hexadecimal number. Decimal 146097 is equal to 23ab1h, so it takes only five characters to represent such a date in (DDDDD)h format. Conversion between MMDDYY dates and (DDDDD)h is simple, and is presented in Figure 2. It is now possible to use the first character in a six-character array to store P. Years from 1600 through 1999 are in the first period; years from 2000 through 2399 are in the second period, and so on. (Note that the Gregorian calendar is valid only after year 1582.) Thus we come to a derived format (PDDDDD)h. If you use the (DDDDD)h or (PDDDDD)h format there are several ways you can detect whether a given date is stored in the conventional format (MMDDYY) or compressed. If you stick to the (DDDDD)h format, which is valid for only 400 years, you can use the fact that it is only five characters long to distinguish it from the six-character MMDDYY. If you use the (PDDDDD)h format you can use the fact that MM fields always start with either 0 or 1. Then let 2 represent the first 400-year period in (PDDDDD)h, 3 represent the second, etc. If the first digit is 2 or more the date is compressed. This format is also detectable because its first digit P is also equal to 2 or greater.
(DDD)64 Format
Just as hexadecimal numbers can be used to represent dates in a more compressed format, larger bases can bring even higher degrees of compression. A numeric system with a base of 64 is very convenient, for example. The relationship between this system and octal numbers is similar to the relationship between binary and hexadecimal numbers. A binary number can be converted to hexadecimal by dividing it into blocks of four binary digits (starting with the least significant bit) and in each block substituting the hexadecimal number equivalent to these four digits. Similarly, to convert an octal number to base 64, first divide the number into blocks of two octal digits (starting with the least significant octal digit). Each two-digit block can then be replaced by a single base-64 digit.
To convert a MMDDYY date to base 64 I first convert it to a decimal number of days past January 1, 1600. I then convert number of days to an octal number of six digits. Finally, I divide these six digits into three two-digit blocks and convert each of them to an index into a table of base 64 digits. The table is organized as follows:
Index Base 64 Digit 0 - 9 '0'-'9' 10 - 35 'A'-'Z' 36 - 61 'a'-'z' 62 - 63 '#'-'$'This process converts the date to a three-character string in which each D symbol represents a number in base 64. This format can represent dates only from 01/01/1600 through 09/23/2317, which satisfies virtually all practical needs. Figure 3 shows the high-level code to convert from MMDDYYYY to (DDD)64. The full source code for date conversion is available from the CUJ ftp site (see p. 3 for downloading instructions). The remaining three bytes can be used to store other information, such as time-of-day. Only one base-64 digit is needed to store each of hours (hh), minutes (mm), and seconds (ss) that represent time-of-day. Combining the (DDD)64 format with the compressed time format results in (hmsDDD)64. If I select an offset of 2 for the h field; this format also becomes detectable.
Test Driver
Figure 4 presents a brief test program that prompts the user for a MMDDYY date and calls the conversion routines to convert the date to the formats described above. The program also prompts the user for two century digits (e.g., "17," "18," etc.), since it doesn't have enough information to represent the equivalent of a MMDDYYYY date.
Recommendations
The sidebar "Evaluating Date Compression Formats" presents some criteria that you can use to evaluate the formats I have presented in this article, as well as other formats. In general, you can use the (Mh)(Wh)DDYY and (PDDDDD)h formats to represent the equivalent of a MMDDCCYY date in six characters. The (Mh)(Wh)DDYY is easier to read, but the (PDDDDD)h is more compatible with the DDDDDD format used in many scientific applications. (DDD)64 is useful for "squeezing" extra data into a six-character space.
The (DDDDD)h format without a period specifier (P) cannot be recommended because dates related to the first 400-year period (1600 - 1999) and the second (2000 - 2399) will both be in wide use for 110 years or more. You could still use (DDDDD)h if you selected a starting date other than January 1, 1600. For example, if you use January 1, 1800 as the starting date, (DDDDD)h will cover a range up to December 31, 2199.
Notes and References
[1] Daniel Zwillinger, Editor-In-Chief. CRC Standard Mathematical Tables and Formulae, 30th Edition (CRC Press, 1996). See page 737, "Day of Week for Any Given Day."
[2] I. A. Klimishin. Kalendar I Hronologia (Calendar and Chronologie). (Nauka, Moscow, 1985). Text is in Russian.
[3] 400 Gregorian years = (365 x 400) + (24 x 3) <number of leap years in centuries starting with a non-leap year> + (25 x 1) <number of leap years in centuries starting with a leap year> = 146,097 days. So it takes six decimal digits to represent the day count in a 400-year interval.
Leon Iofin has a Ph.D. in Applied Mechanics from St. Petersburg Polytechnic Institute, Russia. He has been working for EER Systems, Inc. for three years as a Senior Systems Engineer. He has been using C++ for seven years for Windows programming. His current interests are image processing, biometrics, pattern recognition, and computational mathematics. You can reach him at liofin@erols.com.