The MacOS Find-By-Content Engine

Dr. Dobb's Journal December 2000

Putting a built-in search engine to work

By Chilton Webb

Chilton is the editor and developer of MacintoshDeveloper.com (http://www.ddj.com/macintoshdeveloper/). He can be contacted at chilton@tca.net.

I remember years ago, staring in wonder at a new technology that was going to revolutionize everything Macintosh. It was part of the then new operating system dubbed "Copland." No, it wasn't the high-quality icons or the preemptive multitasking. It was V-Twin -- a search engine unparalleled in a desktop computer. But it was just a search engine, so what?

Hardly. V-Twin had the ability to instantly scan gigabytes of data and extract precisely what you were looking for, even if you weren't specific in your request. It's like telling someone to go to the store and get you "something," and they'd return with a bottle of Dr. Pepper and some donuts -- exactly what you wanted.

Alas, as hopes of Copland began to fade, it seemed V-Twin would never come to pass. But it did -- and it's in every MacOS since OS 8.1. Today we call it "Find-By-Content" (FBC); and for programmers, it's darn near indispensable. In this article, I'll examine the technology and show how you can embed FBC in your application.

FBC Searches

FBC is a system-level search facility implemented as a Code Fragment Manager library. To use it, you should weak-link against the library and check the Gestalt selectors to see if FBC is available before making calls to it. To this end, there are two Gestalt selectors -- gestaltFBCVersion (fbcv) and gestaltFBCCurrentVersion. Generally speaking, you check gestaltFBCCurrentVersion to see if the version on the user's Mac is different than the one you're using. If it is, it's up to you whether you want to proceed. Next you'll want to know if the engine is currently offline. Really, the only time the engine won't be available is when a drive is being indexed. To find out if indexing is currently underway, head to the Gestalts again. The gestaltFBCIndexingState selector generates either safe(0) or critical(1). If it's "critical," you can go ahead and issue your search, but you'll have to wait until the indexing is in safe mode before you'll get your results back. At this point, you're ready to search.

A typical FBC search consists of four distinct parts.

This process is straightforward. To create your initial session, you call FBCCreateSearchSession with a pointer to an FBCSearchSession variable. This function builds a new session or returns an OSStatus error if there's a problem.

Next, you tell your newly created FBCSearchSession where it's going to be searching. You can specify a folder, entire drive, or multiple drives. The easiest way to do this is to add all of the volumes, using the FBCAddAllVolumesToSession function. You pass it your session variable to add all of the drives to the search.

Some drives may not be indexed, so they won't be searched. If you want more control over that, you can pass FBCVolumeIsIndexed a volume reference number and it will tell you the status of the index on that drive -- True or False, depending on whether the volume is indexed. If it is indexed, you can call FBCAddVolumeToSession with your session and the volume reference number to add it to your search session. Alternatively, you may remove volumes from the search session by passing the FBCRemoveVolumeFromSession function a pointer to your session and the volume reference number for the drive you wish to remove.

In the event you encounter trouble at any point after you've created a search session, you should shut down the session by passing the session variable to FBCDestroySearchSession.

The FBC engine controls all memory management. You shouldn't have to do any memory management on your own.

Most likely, your exposure to the FBC engine has been searching documents for a specific string of text in Sherlock. You can do that in your application, too. To perform a query string search, call FBCDoQuerySearch and pass it your session variable, query string, and any additional options you want considered.

There are three powerful options you have at this point:

FBC lets you perform searches based on a query string, as is implemented in Sherlock. In addition to this, you can search for files that match the results of previous searches, as well as search for files that have similar contents to other files.

To search for files matching a series of search results, call FBCDoExampleSearch and pass it the index of hit files from previous searches. To find files that are similar to other files on a volume, call FBCBlindExampleSearch with an array of FSSpecs for files you want to find matches to. While these two functions showcase the more advanced capabilities of this engine, implementing and designing an interface for these concepts is up to you. For example, FBC in Sherlock relies on contextual menus in the Finder to search for files that are like other files. Once you've performed your search, you should retrieve info on each matching document ("hit"). You can pass FBCGetHitCount your session and an unsigned int32, and it will populate the int with a count of matched documents. For each match, you can call FBCGetHitDocument to retrieve that doc's FSSpec. Call FBCGetHitScore to obtain a score for the file (this will be returned as a decimal fraction). FBCGetMatchedWords will create a list of matched words for the file.

If you want to search again, you can do so without creating a new session simply by calling FBCReleaseSessionHits. You can issue multiple searches on the same drives by calling FBCCloneSearchSession. This is slightly faster than going through the process all over again.

When finished with your search, make sure you send FBCDestroySearchSession your search session so it can clear up that block of memory. FBC features two additional capabilities. You can index a specific list of files on demand, or return a summary of the most relevant sentences from any one file. To index files, call FBCIndexItems with a pointer to an FSSpec array. Combining this feature with narrowing your search to certain files makes it possible to use FBC for documents specific to your application, such as an e-mail client, web browser, or text editor. Calling this will index every file you pass it, regardless of whether it has indexed that file before, so be careful when using it to avoid reindexing unchanged files. You can find out if a volume has been referenced by calling FBCVolumeIsIndexed, which will return True if that is the case. FBCVolumeIndexTimeStamp will tell you when a volume was indexed last. Pass it a UInt32 and it will return the date in a format similar to that returned by GetDateTime. FBCVolumeIsRemote tells you if a particular volume is remote. You may want to exclude remote volumes from a search. For instance, the iMac source-code server at MacintoshDeveloper.com was originally going to allow TCP/IP file sharing, so you could perform searches using FBC on the volume without using a web browser. Trials of this revealed that FBC searches over the Internet are horribly slow, and bog down both the client and the server for the duration of the search. If you decide to allow remote volumes to be searched, at least give users the ability to decide if they really want to do that.

One impressive FBC capability is its ability to summarize text very rapidly. If you pass FBCSummarize a text buffer, it returns to you a summary of the text in that buffer. To use it, simply pass it your input text to be summarized and a length for that text, an output buffer pointer, and length pointer. You can specify the maximum number of sentences it returns, or 0. If you pass it 0, FBC returns one sentence for every 10 sentences in your input buffer or one sentence if there are fewer than 10 sentences in the input buffer. Lastly, you can reserve heap space in your heap zone for your callback routines if you desire to use FBCSetHeapReservation. If you don't do this, 200 KB will be reserved for you. You can also issue a FBCCallbackProcPtr callback that will be allowed to return the status of a search to users during the search (to allow updating a progress bar, for example). To do so, call FBCSetCallback with your proc. If you don't explicitly set a callback, the default is used, which just calls WaitNextEvent.

Conclusion

Find-By-Content is a powerful tool that is woefully underused by MacOS applications. It is easy to implement and can add power to any application with very little code. It has been said that over a third of the time people use computers is spent searching for something. Giving your users one more tool to aid them in their quest will be greatly appreciated. There are FBC plug-ins for SuperCard and REALBasic, as well as PowerPlant classes for using it. Finally, FBC is Carbon compliant, so your apps will have this power well into the next generation of the MacOS.

DDJ