Features – Issues in Document Retrieval with DOCS Open

Matthew S. DelNero is a Litigation Technical Support Specialist at Mintz, Levin, Cohn, Ferris, Glovsky and Popeo, P.C. in Boston, Massachusetts. At Mintz Levin, Matthew is focused on improvements to the Litigation Section’s internal research capabilities. As part of that process, he has worked extensively with DOCS Open. Matthew graduated summa cum laude from Tufts University in 1998. As a Junior at the university, he was elected a member of the Phi Beta Kappa society. Matthew begins his legal studies at Harvard Law School in September, 1999.


Introduction

At a conference in September, I mentioned to a vendor that Mintz Levin is working with PC Docs, Inc. to correct a few major problems we are experiencing with DOCS Open. He gave a knowing laugh, commenting on the “love-hate” relationship many firms have with DOCS. I offer this article not as a critique of DOCS Open, but as an investigation of issues behind the peculiar relationship many of us have with this widely used document management system.

I should state before proceeding further that I write from the standpoint of a DOCS Open user, and not that of a database expert. Working on a team dedicated to enriching this firm’s use of DOCS Open, I have gained knowledge as to the inner workings of the system. Nonetheless, I am certain that a person with expertise in this area would offer a different perspective.

Using DOCS Open

For the most basic of document retrieval features, DOCS Open works flawlessly. Such is the case when one knows the precise number or title of the document to be retrieved. Most users, however, need more sophisticated features to conduct research of internal work product. While the search features of DOCS Open are powerful, users must be particularly skilled in the search logic in order to obtain good results.

The dilemma is that most attorneys are not, and should not be expected to be, well-versed in this area. Such is the case with an important DOCS Open feature, content searching. At first glance, content searching seems like the perfect answer to the question “how does a firm best research its internal work product?” Unfortunately, this solution is probably a little too good to be true. I recently spoke with an IS professional at a large New York-based firm who informed me that they did not even activate the content search option, since they expected attorneys would be disappointed with it. At Mintz Levin, we have experienced numerous difficulties with content searching. Although the content search feature is not inherently flawed, it is also not one-hundred percent reliable.

The Verity Search Engine

To begin, while the Verity search engine is a powerful tool, the syntax it requires is rather complex. After reviewing Appendix A of the DOCS Open Users Guide, I felt more comfortable with the various operators involved in content searching. Nonetheless, most attorneys do not have the time (or perhaps, patience) to learn this language. While the common Boolean operators, such as AND, OR, and NOT, pose few problems, most users do not grasp the difference between the MANY, STEM, WILDCARD, SOUNDEX, PHRASE, NEAR, ACCRUE and CASE operators. Also, these operators can create much instability under certain conditions.

For example, we have found that use of the NEAR operator, as opposed to NEAR/n (where “n” is an integer between 1 and 1,024), often results in a Not Responding status for DOCS Open. It is then necessary to close DOCS and Microsoft Word via Windows Close Program. For this reason, many users here are wary of the NEAR operator. Yet we have not experienced any such difficulties when utilizing NEAR/n. I looked into the methodology behind each operator to gain some understanding of this problem. Both the NEAR and the NEAR/n operators employ relevancy rankings. With the NEAR operator, a document which contains all search terms in closest proximity receives the highest relevancy ranking. Others are ranked in relation to it. The highest score, “1”, is assigned to documents where the words are located next to each other. Calculations are made for all documents with the search terms present. However, if a document scores less than .75 it is not retrieved. The NEAR/n, in calculating relevancy rankings, also assigns scores based on the relative distance of search terms. Yet in this case, if terms are further apart than the integer assigned, no calculation is attempted and such documents are not retrieved. When the NEAR operator causes a Not Responding message, I notice that the clock displayed in the search screen stops moving. Also, the PC’s clock freezes, and does not change until DOCS Open is closed through the Windows task manager. It seems that in the attempted relevancy calculation, the NEAR operator causes the client’s memory to fill. The PC uses all the CPU it can obtain, rendering it unwilling to wait for I/O (input/output).

It is my understanding that the Verity full-text search engine requires at least some processing to be done on the client PC side. As such, results alone are not transferred to the client PC, yet rather quantities of information to be processed are included. It is very possible that calculating the relevancy rankings for all documents containing the search terms is too much for most client PCs to handle. (The PCs on which I have attempted NEAR searches have all had 64MB RAM and 266MHz Pentium processors.) One should note, however, that when we search by both “document contents and profile fields” instead of just “document contents”, search results are returned and the PC avoids a “Not Responding” status. The difficulty encountered with the NEAR operator exemplifies the sometimes unpredictable nature of DOCS Open.

Given these problems, our firm is investigating options to simplify the content search process. There is a DOCS Open plug-in, created by Kramer Lee & Associates (U.K.) and distributed in America by Levit & James, Inc., called EZSearch. This plug-in appears to be a good solution to the complexity of Verity syntax. EZSearch is a custom-built, graphical interface which provides a simplified form for users to enter their search queries. Users need not understand even the simplest of Boolean logic to enter a content search. EZSearch translates data entered on the form to Verity Syntax. For example, the design of the EZSearch form prevents users from employing the NEAR function without the “/n” integer. With this simplified DOCS Open plug-in, I expect that more attorneys will employ content searching as part of their research.

Unfortunately, simplifying Verity Syntax does not provide all the answers. Using any enterprise document management system, including DOCS Open, for internal research poses a number of issues. Full-text retrieval is not a perfect science. Two terms, “recall” and “precision,” should be used in discussing full-text retrieval. Recall deals with the number of hits returned, whereas precision is the ratio of relevant hits to the number recalled. There is an indirect relationship between the two. That is, the greater the number of documents returned (recall), the lower the ratio of relevant hits found to that number (precision). Broader searches tend to turn up more relevant documents, yet are less helpful since there are also too many irrelevant documents. Narrower searches run the risk of missing key documents. For this reason, users often become frustrated when conducting content searches. At large firms, there are hundreds of thousands, if not millions, of documents in various DOCS Open libraries. Mixed in with useful briefs and memoranda are fax covers, rough drafts of letters, and other less than useful documents.

The logical answer is to search by document TYPE in the profile screen, along with the content search. Unfortunately, this method is not always practical, given the amount of time it can take to return a hit list. When conducting a profile (SQL) search, the query optimizer attempts to use an index (which is relatively quick). Yet if the TYPE is found on over twenty percent of hits it forces a full table scan. Performing this kind of scan is a lengthy process. At many firms, a common document type, such as “brief”, meets this criteria. Thus, when a user performs a profile search on such a type, a serial search is run and results are produced at a much slower rate. Given that it is unrealistic for users to wait more than a couple minutes for search returns, they generally should avoid searching for documents by any profile fields which meet the twenty percent criterion.

The lesson from the profile search example is that, when possible, divide larger TYPE fields into smaller sub-fields for users to enter. Unfortunately, getting users to enter the right TYPE, especially in large firms, can be a daunting task. The benefit to not breaking down commonly used types rests in a basic psychological issue. Especially at a busy law firm, users often profile documents in a hurry. If confronted with too many document types to select, they will pick one such as “miscellaneous”. When there are only a few commonly used types to choose from, users are less likely to select “miscellaneous”.

Profile searches by TYPE cannot be conducted with great efficiency when the twenty percent criterion is met. I recommend that users employ only content searching and those fields which are guaranteed not to meet that criterion. Begin with a focus on relevancy (broader searches) and then use the “narrow” function (located in the DOCS Open menu under “search”) to gain higher precision.

On a final note, ensure that profile searches can perform at their optimal level, regardless of whether or not a search meets the twenty percent threshold. The health of SQL database indices are key to this point. To correct unusually slow search speeds, PC Docs, Inc. Technical Support recommended that we drop and rebuild all DOCS Open indices in the SQL server. We were directed to a SQL procedure, provided on the PC Docs Technical Support web site, capable of performing that function. As always, we were grateful to PC Docs, Inc. Technical Support, which has been highly receptive to solving any problems Mintz Levin has experienced with DOCS Open. Although profile searches are still notably slower than the full-text breed, they now return results at a faster rate than previously encountered.

As I work more with DOCS Open, I am increasingly aware that it is a powerful tool, yet one which requires careful maintenance. In gaining knowledge of its capabilities and limitations, I am able to instruct users as to proper methods of accessing internal work product. The most important lesson is that DOCS Open will not take care of itself. Administrators must keep on top of issues as they arise, and make good use of the vast resources provided by PC Docs, Inc. Technical Support services. It is only by acknowledging and understanding the complexities of DOCS Open that users can best manipulate it to serve their document retrieval needs.

Posted in: Features, Information Management